Metabolomics analysis of Papaya sensory profiles

Aims

The aim of this project was to determine the role of flavour related metabolites in papaya fruit flavour, focussing on taste and aroma. As demonstrated by Colantonio et al. (2022), metabolomic analysis of fruit can identify metabolite markers for fruit flavour selection in breeding programs. Subsequently, genetic markers may be developed, which are neccesary for papaya breeding due to the challenges associated with growing thousands of mature trees for phenotyping (Lomax et al. 2024). We aimed to build on the work reported by Zhou et al. (2022)

Objectives

  1. Examine the general differences in sensory and metabolite profiles between 27 papaya genotypes using principal component analysis.
  2. Apply machine learning models (GBM and BGLR) to analyse how specific metabolites contribute to aroma and flavor characteristics in papaya.
  3. Identify trends in consumer liking scores and consumer demographic information using a clustering algorithm (APcluster)
  4. Apply machine learning models (GBM and BGLR) to analyse the effect of sensory attributes and metabolite concentrations on papaya liking scores
  5. Generate a predictive machine learning model (XGBoost) to estimate fruit liking based on targeted metabolite concentrations

Methods

Sensory data was collected by a trained and untrained panel of tasters. The trained panel undertook standard quantitative descriptive analysis (QDA) for 27 papaya genotypes. The untrained panelists consisted of 125 consumers recruit for a “tropical fruit tasting”, who then scored nine disctinctly flavoured papaya genotypes using a labelled affective magnitude scale between 0 to 100 for liking.

Papaya fruit was collected from a single location (Dimbulah, QLD, AUS) on four separate occasions: August 2022 (QDA), February 2023 (QDA), January 2024 (Consumer) and February 2024 (QDA). The fruit was collected and treated aligned with industry management practices (e.g. ripening index 2-3, fungicide/ethylene, refrigerated truck transportation).

Chemical analysis of the fruit employed various methods, including traditional measurements for fruit quality (i.e. TSS, TA, pH, water content) and analytical methods (HS-SPME-GC-MS/MS, LC-MS/MS and IC-Au detector). To preserve the metabolite profile of the fruit after sensory preparation, the fruit was flash frozen and stored at -80. Further, the fruit subject to VOC analysis was prepared at temperatures below -80 using liquid nitrogen. Samples for LC and IC were freeze dried and subject to solvent extraction: IC, sugars (ethanol); LC, GTR (methanol).

Target VOCs were selected based on the concentration ranges reported Brewer et al. (2021), focusing on species that exceeded their odour threshold values in water (OAV > 1). The SPME optimisation and GC-MS conditions largely emulated Pico et al. (2022).

Data analysis preprocessing

The initial steps of the data analysis pipeline included:

  • Installing/loading required packages
# list of packages
required_packages <- c(
    "juba/rmdformats","kableExtra","gt","flextable","DT", # Document formatting & tables

    "tidyverse","readxl","janitor", # Data wrangling

    "ggpmisc","ggthemes","ggrepel","geomtextpath","scales", # Plotting stuff

    "FactoMineR","factoextra", # PCA & visualization

    "car","agricolae","tableone", # Stats & summaries

    "BGLR","caret","gbm","xgboost","apcluster","lme4","emmeans" # Modeling & ML
)

# Install all packages
# pak::pak(required_packages)

# load all packages
pacman::p_load(char = basename(required_packages), install=FALSE)
  • Compiling meta data
# Load metadata from Excel
genotype_table <- 
  read_excel("data/papaya_sample_meta_data.xlsx",
             sheet = "meta_data")

# Create sensory metadata table
sensory_meta <- 
  data.frame(
    "var" = c(
        # Aroma attributes
        "aroma_intensity_AR", "sweet_fruit_AR", "musty_off_note_AR",
        "fishy_AR", "citrus_AR", "floral_AR",
        # Texture attributes
        "resistance_TX", "velvety_TX", "juiciness_TX", "dissolving_TX",
        "fibrous_TX",
        # Flavor attributes
        "flavour_intensity_FL", "sweetness_FL", "bitterness_FL",
        "musty_FL", "floral_FL",
        # Aftertaste attributes
        "bitter_AT", "sweet_AT", "metallic_AT", "prickly_AT"),
    
    "group" = c(
        rep("aroma", 6),
        rep("texture", 5),
        rep("flavour", 5),
        rep("aftertaste", 4))) %>% 
  
   mutate(group = snakecase::to_title_case(group),
          label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var)))

# Create classification key for VOCs
class_key <- 
  read_excel("data/papaya_sample_meta_data.xlsx",
             sheet = "VOCs") %>%
  select(-(Class:Class2)) %>%
  rename("var" = "variable") %>%
  mutate(
    Class3 = str_replace_all(Class3,
                            c("Ester" = "Lipid-derived"))
  ) %>%
  rename("group" = "Class3") %>%
  mutate(
    label = str_replace_all(var,
                           c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone",
                             "E_2_nonal" = "E-2-Nonanal",
                             "P_cymene" = "P-Cymene",
                             "GTR_conc" = "GTR",
                             "wet fraction"="Water fraction",
                             "_" = " ")))

glimpse(genotype_table)
> Rows: 123
> Columns: 6
> $ ID        <chr> "01-050", "01-144", "01-217", "01-487", "02-081", "02-480", …
> $ label     <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1_Malay", …
> $ assay     <chr> "Feb24", "Feb24", "Feb24", "Feb24", "Feb24", "Feb24", "Feb24…
> $ Flesh     <chr> "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red…
> $ Genotype  <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1_Malay", …
> $ Genotype2 <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1 Malay", …
glimpse(sensory_meta)
> Rows: 20
> Columns: 3
> $ var   <chr> "aroma_intensity_AR", "sweet_fruit_AR", "musty_off_note_AR", "fi…
> $ group <chr> "Aroma", "Aroma", "Aroma", "Aroma", "Aroma", "Aroma", "Texture",…
> $ label <chr> "Aroma intensity", "Sweet fruit", "Musty off note", "Fishy", "Ci…
glimpse(class_key)
> Rows: 34
> Columns: 3
> $ var   <chr> "Hexanal", "P_cymene", "Terpinolene", "Sulcatone", "Beta_Cycloci…
> $ group <chr> "Lipid-derived", "Carotenoid/Terpenes", "Carotenoid/Terpenes", "
> $ label <chr> "Hexanal", "P-Cymene", "Terpinolene", "Sulcatone", "Beta Cycloci…
  • Calculating VOC matrix effects
calculate_matrix_effect <- function(data,col_range) {
  data %>%
    pivot_longer(
      cols = all_of(col_range),
      names_to = "species",
      values_to = "peak_area"
    ) %>%
    group_by(Spiked, type, species) %>%
    summarise(mean_peak_area = mean(peak_area), .groups = "drop") %>%
    mutate(sample = paste(Spiked, type, sep = "_")) %>%
    select(-c(Spiked, type)) %>%
    pivot_wider(
      names_from = "sample",
      values_from = "mean_peak_area"
    ) %>%
    mutate(
      red_matrix_effect = (((T_Red - F_Red)/F_Standard) - 1)*100,
      yellow_matrix_effect = (((T_Yellow - F_Yellow)/F_Standard) - 1)*100
    )
}

# Read and process data
matrix_effect_low <- read_xlsx("data/voc_matrix_effect.xlsx", "Matrix low split") %>% 
  filter(!ID %in% c("R8","Y8")) %>% 
  calculate_matrix_effect(col_range = 4:28)

matrix_effect_high <- read_xlsx("data/voc_matrix_effect.xlsx", "matrix high split") %>%
  calculate_matrix_effect(col_range = 4:7)

# Combine and clean species names
matrix_effect <- rbind(matrix_effect_high, matrix_effect_low) %>%
  mutate(species = str_replace_all(species, "-", "_"))

print(knitr::kable(matrix_effect[,c(1,7:8)],
      col.names = c("Flesh Color",
                    "Matrix effect red",
                    "Matrix effect yellow"),
      digits = 1,
      align = c('l', 'c', 'c'),
      caption = "matrix effect for VOC quantification"))
> 
> 
> Table: (\#tab:VOC Matrix Effect)matrix effect for VOC quantification
> 
> |Flesh Color                     | Matrix effect red | Matrix effect yellow |
> |:-------------------------------|:-----------------:|:--------------------:|
> |Cis_Linalool oxide              |      -196.4       |        -123.0        |
> |Linalool                        |       410.2       |        1583.0        |
> |Linalool_oxide                  |      -138.3       |        -110.3        |
> |Trans_Linalool oxide (furanoid) |       -85.2       |        -98.8         |
> |2_Octanone                      |       -46.3       |        -52.0         |
> |BITC                            |       -62.8       |        -71.7         |
> |Benzaldehyde                    |       -40.0       |        -36.8         |
> |Beta_Cyclocitral                |       -58.6       |        -55.2         |
> |Beta_ionone                     |       -85.8       |        -84.0         |
> |Decanal                         |       -58.4       |        -65.3         |
> |E_2_nonal                       |       -67.7       |        -70.2         |
> |Gamma_octalactone               |       -54.2       |        -17.9         |
> |Geraniol                        |       471.1       |        377.2         |
> |Geranylacetone                  |       -60.8       |        -72.1         |
> |Heptanal                        |       45.2        |        -36.6         |
> |Hexanal                         |        2.5        |        -60.9         |
> |Limonene                        |       -1.1        |        -52.6         |
> |Methyl_dodecanoate              |       -92.4       |        -95.0         |
> |Methyl_hexanoate                |       -61.7       |        -47.4         |
> |Methyl_octanoate                |       -36.7       |        -49.8         |
> |Neryl_Geranyl_acetone           |       -65.1       |        -73.2         |
> |Nerylacetone                    |       -72.5       |        -75.1         |
> |Nonanal                         |       -30.6       |        -59.2         |
> |Octanal                         |       -40.3       |        -62.4         |
> |P_cymene                        |       -54.9       |        -64.1         |
> |Phenylacetaldehyde              |       -39.1       |         -8.0         |
> |Sulcatone                       |       -7.9        |        -20.1         |
> |Terpinene                       |       -39.7       |        -66.8         |
> |Terpinolene                     |       -51.1       |        -56.9         |
  • Transforming VOC data to ng/kg
calibration_table <- read_excel("data/Results_22_23.xlsx", "densities") %>% 
  select(Hexanal:Linalool) %>% 
  t() %>% as.data.frame() %>% rownames_to_column("species") %>% 
  set_names(c("species", "densities", "OT", "purity"))

res_sheets <- c("Aug22", "Feb23", "Feb24")

VOCs <- res_sheets %>%
  # Read and combine sheets
  map(~read_excel("data/Results_22_23.xlsx", .x) %>%
        pivot_longer(cols = (Hexanal:BITC),
                     names_to = "species",
                     values_to = "concentration") %>%
        mutate(assay = .x,
               Genotype = as.factor(Genotype))) %>%
  list_rbind() %>%
  
  # Clean and join data
  mutate(Genotype = str_replace_all(Genotype, c("RB1_3" = "RB1", "Holland_5_3" = "Holland_5"))) %>%
  left_join(calibration_table) %>%
  left_join(genotype_table) %>%
  
  # Calculate concentrations
  mutate(
    conc_ppb = (concentration * densities * purity) * (0.00425/0.0042), # ug/L -> ug/KgFW = ug/L * volume(L)/mass(Kg)
    species = str_replace_all(species, "-", "_")) %>%
  
  # update matrix corrections
  left_join(
    matrix_effect %>% 
      mutate(across(
        ends_with("matrix_effect"),
        ~ case_when(
          species %in% c("Linalool","Linalool_oxide") ~ .[species == "Trans_Linalool oxide (furanoid)"],
          TRUE ~ .
          )
        ))
    ) %>%
  
   # Apply matrix corrections 
  mutate(
    corrected_conc_ppb = case_when(
      Flesh == "Red" ~ conc_ppb / (1 + (red_matrix_effect/100)),
      Flesh == "Yellow" ~ conc_ppb / (1 + (yellow_matrix_effect)/100)
      ),
    OAV = corrected_conc_ppb/OT,
    ) %>%
  relocate(c(assay, Flesh, label), .before = species)

# summary of VOC concentrations

VOC_summary <- VOCs %>%
  group_by(species) %>%
  summarise(
    n = n(),                                    
    mean_conc = mean(corrected_conc_ppb, na.rm = TRUE),
    sd_conc = sd(corrected_conc_ppb, na.rm = TRUE),      
    se_conc = sd_conc/sqrt(n),                 
    median_conc = median(corrected_conc_ppb, na.rm = TRUE),
    min_conc = min(corrected_conc_ppb, na.rm = TRUE),
    max_conc = max(corrected_conc_ppb, na.rm = TRUE),
    n_detected = sum(corrected_conc_ppb>0.01),        
    perc_detected = n_detected/n * 100         
  ) %>%
  arrange(desc(mean_conc)) %>%  
  mutate(across(where(is.numeric), ~ round(.x, digits = 2))) %>% 
  ungroup() 
  • Sugar extraction efficiency
sugar_key <- read_xlsx("data/July24_all.xlsx", sheet = "eth_key")
extraction_efficiency <-
  left_join(sugar_key, read_xlsx("data/July24_all.xlsx",
            sheet = "all_raw"), by = "sample_id") %>%
  filter(batch_id == "pooled") %>%
  mutate(
    Flesh = c(rep("Yellow", 5), rep("Red", 5)),
    Sorbitol_extraction_prop = Sorbitol/20000
  ) %>%
  group_by(Flesh) %>%
  summarise(
    extraction_efficiency_mean = mean(Sorbitol_extraction_prop*100),
    extraction_efficiency_sd = sd(Sorbitol_extraction_prop*100)
  )
print(knitr::kable(extraction_efficiency,
      col.names = c("Flesh Color",
                    "Mean Recovery (%)",
                    "SD (%)"),
      digits = 1,
      align = c('l', 'c', 'c'),
      caption = "Sorbitol extraction efficiency from freeze-dried tissue samples"))
> 
> 
> Table: (\#tab:Sugar extraction efficiency)Sorbitol extraction efficiency from freeze-dried tissue samples
> 
> |Flesh Color | Mean Recovery (%) | SD (%) |
> |:-----------|:-----------------:|:------:|
> |Red         |       84.4        |  0.6   |
> |Yellow      |       85.3        |  2.3   |
  • Sugar data unit transformation and imputation of missing values
bad_dry <- paste0("JL", 10:53) # compromised samples
UI <- paste0("UI", 1:5) # auxiliary samples

# Data import and initial cleaning
raw_sugar <- left_join(sugar_key,
                      read_xlsx("data/July24_all.xlsx", sheet = "all_raw"),
                      by = "sample_id") %>%
  filter(!sample_id %in% bad_dry,
         !sensory_id %in% UI,
         batch_id != "pooled",
         mass != "75") %>% # auxiliary sample
  select(-c(sample_id, Sorbitol, batch_id)) %>%
  rename(ID = sensory_id) %>%
  full_join(genotype_table, by = "ID") %>%
  left_join(read_xlsx("data/July24_all.xlsx", sheet = "weight"),
            by = "ID") %>%
  mutate(FW_factor = wet_fraction/dry_fraction)

# Missing values
na_rows <- raw_sugar[!complete.cases(raw_sugar),]
print(knitr::kable(na_rows,
                   caption = "Missing values in sugar dataset"))
> 
> 
> Table: (\#tab:Sugar_transformation)Missing values in sugar dataset
> 
> |ID     | mass|  Glucose| Fructose|   Sucrose|label     |assay |Flesh  |Genotype  |Genotype2   | Batch_id| dry_fraction| wet_fraction| FW_factor|
> |:------|----:|--------:|--------:|---------:|:---------|:-----|:------|:---------|:-----------|--------:|------------:|------------:|---------:|
> |01-105 |  100| 20923.45| 22637.53|        NA|ML1-26-12 |Aug22 |Yellow |ML1-26-12 |ML1-26-12   |        1|    0.1371277|    0.8628723|  6.292473|
> |01-1   |  100| 17030.04| 17982.28| 22326.106|ML1-26-12 |Aug22 |Yellow |ML1-26-12 |ML1-26-12   |       NA|           NA|           NA|        NA|
> |02-1   |  100| 21081.63| 20147.66| 16771.018|ML1-35-7  |Aug22 |Yellow |ML1-35-7  |ML1-35-7    |       NA|           NA|           NA|        NA|
> |03-1   |  100| 16326.40| 18285.30| 15781.072|ML2-8-17  |Aug22 |Yellow |ML2-8-17  |Moonlight 2 |       NA|           NA|           NA|        NA|
> |05-1   |  100| 19006.88| 19580.25| 19986.072|ML3-3-13  |Aug22 |Yellow |ML3-3-13  |ML3-3-13    |       NA|           NA|           NA|        NA|
> |07-1   |  101| 18744.57| 18958.28| 11242.871|1B        |Aug22 |Yellow |1B        |1B          |       NA|           NA|           NA|        NA|
> |02-654 |  101| 28661.26| 28256.75|  1213.175|ML1-35-7  |Aug22 |Yellow |ML1-35-7  |ML1-35-7    |       NA|           NA|           NA|        NA|
> |03-504 |  101| 28795.46| 30279.16|  1207.327|ML2-8-17  |Aug22 |Yellow |ML2-8-17  |Moonlight 2 |       NA|           NA|           NA|        NA|
> |05-603 |  100| 28863.26| 29848.19|        NA|ML3-3-13  |Aug22 |Yellow |ML3-3-13  |ML3-3-13    |        1|    0.1311861|    0.8688139|  6.622759|
> |02-021 |   NA|       NA|       NA|        NA|ML1-35-7  |Aug22 |Yellow |ML1-35-7  |ML1-35-7    |        1|    0.1350356|    0.8649644|  6.405455|
> |12-795 |   NA|       NA|       NA|        NA|Solo      |Aug22 |Red    |Solo      |Solo        |        1|    0.1160407|    0.8839593|  7.617667|

# Get all imputed datasets using default parameters
set.seed(11)

imp <- mice::mice(raw_sugar, m = 25, maxit = 100, print = FALSE)
> Warning: Number of logged events: 7

imputed_sugar <- complete(imp, "long") %>%
  group_by(ID) %>%
  summarise(across(c(mass, FW_factor, Glucose, Fructose, Sucrose), mean))

# new dataset with mean values from 25 random imputations after 100 iterations
final_sugar <- raw_sugar %>%
  select(-c(mass, FW_factor, Glucose, Fructose, Sucrose)) %>%
  left_join(imputed_sugar, by = "ID")

# Unit transformation of raw data to g sugar/100g FW
sugar_vars <- c("Glucose", "Fructose", "Sucrose")

sugar_data <- final_sugar %>%
  mutate(across(all_of(sugar_vars), ~.x * 0.001),  # Accounts for 0.001L extraction volume
         FW = mass * FW_factor,                      # Converts to fresh weight basis
         across(all_of(sugar_vars), ~(.x / FW) * 100),  # Convert to g/100g FW
         total_sugar = rowSums(across(all_of(sugar_vars))),
         across(all_of(sugar_vars), ~.x / total_sugar, .names = "{.col}_prop"),
         wet_fraction = FW_factor/(1+FW_factor),
         dry_fraction = 1-wet_fraction) %>%
  select(assay, Genotype, ID, Flesh, all_of(sugar_vars),
         total_sugar, ends_with("_prop"),label,wet_fraction)

glimpse(sugar_data)
> Rows: 123
> Columns: 13
> $ assay         <chr> "Feb23", "Feb23", "Feb23", "Feb23", "Feb23", "Feb23", "F…
> $ Genotype      <chr> "T2-6-5.27.30", "T2-6-3.21.28", "T2-6-5.27.12", "T2-6-3.
> $ ID            <chr> "01-618", "02-424", "03-849", "04-924", "05-mon", "06-46…
> $ Flesh         <chr> "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", …
> $ Glucose       <dbl> 2.388368, 2.819370, 2.166258, 2.116345, 2.759456, 3.0948…
> $ Fructose      <dbl> 2.177387, 2.402008, 2.156498, 1.926594, 2.610099, 2.7535…
> $ Sucrose       <dbl> 7.654067, 7.026162, 6.312745, 4.587775, 6.481786, 3.1513…
> $ total_sugar   <dbl> 12.219822, 12.247540, 10.635501, 8.630714, 11.851341, 8.…
> $ Glucose_prop  <dbl> 0.1954503, 0.2301989, 0.2036818, 0.2452109, 0.2328391, 0…
> $ Fructose_prop <dbl> 0.1781848, 0.1961216, 0.2027641, 0.2232254, 0.2202366, 0…
> $ Sucrose_prop  <dbl> 0.6263649, 0.5736795, 0.5935541, 0.5315638, 0.5469243, 0…
> $ label         <chr> "T2-6-5.27.30", "T2-6-3.21.28", "T2-6-5.27.12", "T2-6-3.
> $ wet_fraction  <dbl> 0.8515992, 0.8561540, 0.8649311, 0.8783233, 0.8591738, 0
  • Glucotropaeolin unit transformation and QC analysis
# Read sample metadata and define sample IDs
GTR_key <- read_xlsx("data/July24_all.xlsx", sheet = "meth_key")
samples <- paste0("ME", seq(10, 136))

# Process GTR (Glucotropaeolin) concentration data
# Initial concentration in mg/L (mg glucotropaeolin per L methanol extract)
GTR_data <- read_xlsx("data/GTR_data.xlsx", sheet = "GTR") %>%
  # Filter relevant samples
  filter(sample_id %in% samples) %>%

  # Handle non-detects: replace 'n.a.' with detection limit (0.0001 mg/L)
  mutate(
    conc = str_replace(conc, "n.a.", "0.0001"),
    conc = as.numeric(conc)
  ) %>%

  # Reassign sample IDs and remove duplicate
  mutate(sample_id = paste0("ME", seq(10, 135))) %>%
  filter(!sample_id %in% c("ME12","ME57")) %>%

  # Join with sample metadata and moisture content data
  left_join(GTR_key) %>%
  rename(ID = sensory_id) %>%
  left_join(imputed_sugar %>% select(ID, FW_factor)) %>%

  # Convert concentration from mg/L to mg/kgFW
  # Step 1: mg/L * (0.001L/mass_mg) = mg/mgDW
  # Step 2: mg/mgDW / FW_factor = mg/mgFW
  # Step 3: mg/mgFW * 100000 = mg/kgFW
  mutate(
    conc = ((conc * (0.001/mass))/FW_factor) * 100000
  ) %>%

  # Clean up column names and select final variables
  rename(GTR_conc = conc) %>%
  select(ID, GTR_conc) %>% 
  filter(!ID %in% UI)

# Calculate Quality Control metrics for Continuing Calibration Verification (CCV) samples
# Reference standard (1 mg/L) analyzed every 10 samples
CCV_samples <- read_xlsx("data/GTR_data.xlsx", sheet = "GTR") %>%
  mutate(conc = as.numeric(conc)) %>%
  filter(sample_id == "STD_1ppm") %>%
  summarise(
    mean_conc = mean(conc),
    sd_conc = sd(conc),
    rsd_percent = (sd_conc/mean_conc) * 100
  )

# Display QC results
print(knitr::kable(CCV_samples,
                   caption = "Quality Control Metrics for 1 mg/L Reference Standard (n=14)",
                   col.names = c("Mean (mg/L)", "SD (mg/L)", "RSD (%)"),
                   digits = c(4, 4, 2)))
> 
> 
> Table: (\#tab:GTR data)Quality Control Metrics for 1 mg/L Reference Standard (n=14)
> 
> | Mean (mg/L)| SD (mg/L)| RSD (%)|
> |-----------:|---------:|-------:|
> |      1.1513|    0.1589|    13.8|
  • Combining data and imputing missing data
# Load proxy sugar/acid data, excluding redundant Genotype column
proxy <- read_xlsx("data/22_23_proxy_sugar.xlsx", "all") %>% select(-Genotype)

# Combine multiple datasets
conc_matrix <- full_join(sugar_data, GTR_data, relationship = "many-to-many") %>%  
  full_join(proxy, relationship = "many-to-many") %>% 
  full_join(VOCs, relationship = "many-to-many") %>% 
  # Remove auxiliary samples
  filter(!ID %in% c("01-1","02-1","03-1","04-1","05-1","07-1")) %>% 
  # Remove intermediate calculation columns
  select(-c(concentration:purity,conc_ppb:yellow_matrix_effect,OAV)) %>% 
  # Reshape data for analysis
  pivot_wider(names_from = species, 
              values_from = corrected_conc_ppb)

# missing results
print(knitr::kable(conc_matrix[!complete.cases(conc_matrix),],
                   caption = "Results with missing values",
                   digits = 1))
> 
> 
> Table: (\#tab:combine data)Results with missing values
> 
> |assay |Genotype  |ID     |Flesh  | Glucose| Fructose| Sucrose| total_sugar| Glucose_prop| Fructose_prop| Sucrose_prop|label       | wet_fraction| GTR_conc|  pH| Brix|  TA|Genotype2 | Hexanal| P_cymene| Terpinolene| Sulcatone| Linalool_oxide| Linalool| Beta_Cyclocitral| Terpinene| Heptanal| Methyl_hexanoate| Limonene| Octanal| Methyl_octanoate| Nonanal| Decanal| Benzaldehyde| E_2_nonal| Phenylacetaldehyde| Neryl_Geranyl_acetone| Gamma_octalactone| Beta_ionone| BITC|
> |:-----|:---------|:------|:------|-------:|--------:|-------:|-----------:|------------:|-------------:|------------:|:-----------|------------:|--------:|---:|----:|---:|:---------|-------:|--------:|-----------:|---------:|--------------:|--------:|----------------:|---------:|--------:|----------------:|--------:|-------:|----------------:|-------:|-------:|------------:|---------:|------------------:|---------------------:|-----------------:|-----------:|----:|
> |Feb24 |F1_Malay  |02-734 |Red    |     3.8|      3.6|     4.2|        11.6|          0.3|           0.3|          0.4|F1_Malay    |          0.9|       NA| 5.5| 13.4| 0.2|F1 Malay  |     2.1|      7.8|        68.1|      66.4|        16913.1|   5811.8|              2.9|      22.7|     10.5|              0.1|     62.7|     1.5|              1.6|     1.8|     3.6|          6.9|       1.5|                  0|                  10.6|              10.6|         5.2|  1.3|
> |Aug22 |Holland_5 |09-1   |Red    |     2.7|      3.0|     2.1|         7.9|          0.3|           0.4|          0.3|Holland_5_1 |          0.9|      0.2|  NA|   NA|  NA|Holland 5 |    16.2|      4.5|         0.0|      32.6|            0.0|     54.8|              0.0|       0.0|     34.2|              0.0|     10.5|     7.9|              0.5|     9.4|     0.0|          5.1|       0.0|                  0|                  35.1|               4.9|         4.5|  4.6|
> |Aug22 |Solo      |12-1   |Red    |     3.2|      3.4|     2.0|         8.7|          0.4|           0.4|          0.2|Solo        |          0.9|      0.1|  NA|   NA|  NA|Solo      |    19.5|      4.7|        15.1|      72.5|          670.6|    178.7|              0.0|       5.9|     33.8|              3.7|     14.0|     8.4|              0.6|    11.4|     0.0|          6.6|       0.0|                  0|                  76.1|               2.4|         6.7|  0.0|
> |Aug22 |ML3-3-13  |05-575 |Yellow |     5.5|      6.0|     0.1|        11.7|          0.5|           0.5|          0.0|ML3-3-13    |          0.8|       NA| 5.4| 12.3| 0.2|ML3-3-13  |    85.3|      5.9|        16.6|       2.8|         3593.0|    387.3|              0.0|       0.0|     85.2|              0.0|     22.1|    14.3|              0.0|    19.5|     0.0|          7.9|       0.0|                  0|                  32.8|               1.2|         4.7|  6.2|
> |Aug22 |ML1-35-7  |02-021 |Yellow |     3.4|      3.5|     2.8|         9.7|          0.3|           0.4|          0.3|ML1-35-7    |          0.9|       NA| 5.1| 10.6| 0.4|ML1-35-7  |    43.9|      6.4|        19.0|       4.6|       335918.5|   8204.7|              0.0|      10.7|     68.3|             48.5|     34.9|    12.2|              4.0|    14.3|     0.0|         24.8|       0.0|                  0|                  59.9|               9.0|         7.6|  8.1|
> |Aug22 |Solo      |12-795 |Red    |     2.9|      3.0|     1.8|         7.7|          0.4|           0.4|          0.2|Solo        |          0.9|       NA| 5.4| 10.4| 0.4|Solo      |    13.0|      7.2|        32.9|      15.0|         6139.8|   2893.9|              0.0|      23.2|     57.6|              2.8|     44.0|    17.5|              0.5|    19.6|     0.0|         11.7|       0.0|                  0|                 168.5|              41.5|         4.8| 25.2|

# Get all imputed datasets using default parameters
imp_vars <- c("GTR_conc","pH","Brix","TA")

set.seed(11)
init <- mice::mice(conc_matrix,maxit=0)
> Warning: Number of logged events: 6
init$predictorMatrix[,] <- 0 
init$predictorMatrix[imp_vars, c(sugar_vars,"BITC")] <- 1

imp2 <- mice::mice(conc_matrix, predictorMatrix = init$predictorMatrix, m = 25, maxit = 100, print = FALSE)

imputed_conc <- complete(imp2, "long") %>%
  group_by(ID) %>%
  summarise(across(all_of(imp_vars), mean))

# new dataset with mean values from 25 random imputations after 100 iterations (probably overkill)
final_conc_matrix <- conc_matrix %>%
  select(-(all_of(imp_vars))) %>%
  left_join(imputed_conc, by = "ID") %>% 
  relocate(all_of(imp_vars), .before = Hexanal)

# Attaching sensory data
full_data_matrix <- final_conc_matrix %>% left_join(read_xlsx("data/all_sensory_data.xlsx"))

# full_data_matrix %>% select(Flesh, sweetness_FL, Sulcatone, Heptanal) %>% print(n=118)
  
data_summary_genotype <- 
  full_data_matrix %>%
  group_by(Genotype) %>%
  summarise(across(where(is.numeric),
                   ~sprintf("%.2f (%.2f)", mean(.x), sd(.x)))) %>%
  pivot_longer(cols = -Genotype, names_to = "variable") %>%
  pivot_wider(names_from = Genotype, values_from = value)
  

Sensory and metabolite data sumary

print(summarytools::dfSummary(full_data_matrix,
                graph.magnif = 0.82,
                varnumbers = FALSE,
                style = "grid",
                valid.col = TRUE,  # Changed to TRUE to match your desired output
                graph.col = TRUE),  # Added to ensure graphs are displayed
      method = 'render',  # Important for proper HTML rendering
      table.classes = 'st-table st-table-bordered st-table-striped')

Data Frame Summary

full_data_matrix

Dimensions: 118 x 60
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
assay [character]
1. Aug22
2. Feb23
3. Feb24
46(39.0%)
24(20.3%)
48(40.7%)
118 (100.0%) 0 (0.0%)
Genotype [character]
1. RB1
2. Holland_5
3. ML3-3-13
4. 1B
5. C2-6-5.15.2
6. C2-6-5.5.14
7. C2-6-5.9.1
8. Eksotika
9. F1_Malay
10. First_Lady
[ 17 others ]
11(9.3%)
8(6.8%)
5(4.2%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
66(55.9%)
118 (100.0%) 0 (0.0%)
ID [character]
1. 01-039
2. 01-050
3. 01-105
4. 01-144
5. 01-215
6. 01-217
7. 01-487
8. 01-534
9. 01-618
10. 01-776
[ 108 others ]
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
108(91.5%)
118 (100.0%) 0 (0.0%)
Flesh [character]
1. Red
2. Yellow
90(76.3%)
28(23.7%)
118 (100.0%) 0 (0.0%)
Glucose [numeric]
Mean (sd) : 3 (0.8)
min ≤ med ≤ max:
1.3 ≤ 2.8 ≤ 5.5
IQR (CV) : 1 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Fructose [numeric]
Mean (sd) : 2.9 (0.8)
min ≤ med ≤ max:
1.4 ≤ 2.7 ≤ 6
IQR (CV) : 1.1 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Sucrose [numeric]
Mean (sd) : 3.3 (2.3)
min ≤ med ≤ max:
0.1 ≤ 2.7 ≤ 8.9
IQR (CV) : 3.7 (0.7)
118 distinct values 118 (100.0%) 0 (0.0%)
total_sugar [numeric]
Mean (sd) : 9.2 (2.1)
min ≤ med ≤ max:
4.2 ≤ 9.2 ≤ 13.6
IQR (CV) : 3.2 (0.2)
118 distinct values 118 (100.0%) 0 (0.0%)
Glucose_prop [numeric]
Mean (sd) : 0.3 (0.1)
min ≤ med ≤ max:
0.2 ≤ 0.3 ≤ 0.5
IQR (CV) : 0.1 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Fructose_prop [numeric]
Mean (sd) : 0.3 (0.1)
min ≤ med ≤ max:
0.1 ≤ 0.3 ≤ 0.5
IQR (CV) : 0.2 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Sucrose_prop [numeric]
Mean (sd) : 0.3 (0.2)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 0.7
IQR (CV) : 0.3 (0.6)
118 distinct values 118 (100.0%) 0 (0.0%)
label [character]
1. ML3-3-13
2. 1B
3. C2-6-5.15.2
4. C2-6-5.5.14
5. C2-6-5.9.1
6. Eksotika
7. F1_Malay
8. First_Lady
9. Holland_5_1
10. Holland_5_3
[ 20 others ]
5(4.2%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
77(65.3%)
118 (100.0%) 0 (0.0%)
wet_fraction [numeric]
Mean (sd) : 0.9 (0)
min ≤ med ≤ max:
0.8 ≤ 0.9 ≤ 0.9
IQR (CV) : 0 (0)
118 distinct values 118 (100.0%) 0 (0.0%)
Genotype2 [character]
1. RB1
2. Holland 5
3. ML3-3-13
4. 1B
5. C2-6-5.15.2
6. C2-6-5.5.14
7. C2-6-5.9.1
8. Eksotika
9. F1 Malay
10. First Lady
[ 17 others ]
11(9.3%)
8(6.8%)
5(4.2%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
66(55.9%)
118 (100.0%) 0 (0.0%)
GTR_conc [numeric]
Mean (sd) : 0.8 (1.5)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 12.1
IQR (CV) : 0.9 (1.9)
118 distinct values 118 (100.0%) 0 (0.0%)
pH [numeric]
Mean (sd) : 5.5 (0.2)
min ≤ med ≤ max:
5 ≤ 5.5 ≤ 6.1
IQR (CV) : 0.2 (0)
14 distinct values 118 (100.0%) 0 (0.0%)
Brix [numeric]
Mean (sd) : 11.3 (1.8)
min ≤ med ≤ max:
6.6 ≤ 11.1 ≤ 14.2
IQR (CV) : 2.8 (0.2)
59 distinct values 118 (100.0%) 0 (0.0%)
TA [numeric]
Mean (sd) : 0.2 (0.1)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 0.4
IQR (CV) : 0.1 (0.4)
34 distinct values 118 (100.0%) 0 (0.0%)
Hexanal [numeric]
Mean (sd) : 18.4 (21.3)
min ≤ med ≤ max:
0 ≤ 9.8 ≤ 85.3
IQR (CV) : 22.8 (1.2)
118 distinct values 118 (100.0%) 0 (0.0%)
P_cymene [numeric]
Mean (sd) : 5.2 (3.3)
min ≤ med ≤ max:
0 ≤ 5.1 ≤ 20.5
IQR (CV) : 4.2 (0.6)
113 distinct values 118 (100.0%) 0 (0.0%)
Terpinolene [numeric]
Mean (sd) : 20.9 (33.9)
min ≤ med ≤ max:
0 ≤ 14.7 ≤ 213.3
IQR (CV) : 16 (1.6)
96 distinct values 118 (100.0%) 0 (0.0%)
Sulcatone [numeric]
Mean (sd) : 62.9 (84.2)
min ≤ med ≤ max:
0 ≤ 46 ≤ 629.1
IQR (CV) : 66.4 (1.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Linalool_oxide [numeric]
Mean (sd) : 52682.5 (108929.3)
min ≤ med ≤ max:
0 ≤ 7529 ≤ 747850.9
IQR (CV) : 42821.2 (2.1)
117 distinct values 118 (100.0%) 0 (0.0%)
Linalool [numeric]
Mean (sd) : 7313.7 (18543.3)
min ≤ med ≤ max:
0 ≤ 682.3 ≤ 112060.6
IQR (CV) : 4922.4 (2.5)
117 distinct values 118 (100.0%) 0 (0.0%)
Beta_Cyclocitral [numeric]
Mean (sd) : 2.2 (2.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 5 (1.3)
50 distinct values 118 (100.0%) 0 (0.0%)
Terpinene [numeric]
Mean (sd) : 10.5 (15.4)
min ≤ med ≤ max:
0 ≤ 5.9 ≤ 105.5
IQR (CV) : 9.5 (1.5)
107 distinct values 118 (100.0%) 0 (0.0%)
Heptanal [numeric]
Mean (sd) : 32 (24.5)
min ≤ med ≤ max:
0 ≤ 23.4 ≤ 112.2
IQR (CV) : 30.4 (0.8)
118 distinct values 118 (100.0%) 0 (0.0%)
Methyl_hexanoate [numeric]
Mean (sd) : 42.8 (199.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2002.8
IQR (CV) : 8.8 (4.6)
52 distinct values 118 (100.0%) 0 (0.0%)
Limonene [numeric]
Mean (sd) : 26.4 (39.9)
min ≤ med ≤ max:
0 ≤ 12.1 ≤ 250.1
IQR (CV) : 30.3 (1.5)
90 distinct values 118 (100.0%) 0 (0.0%)
Octanal [numeric]
Mean (sd) : 7 (5.9)
min ≤ med ≤ max:
0 ≤ 5.2 ≤ 33.5
IQR (CV) : 8.4 (0.8)
118 distinct values 118 (100.0%) 0 (0.0%)
Methyl_octanoate [numeric]
Mean (sd) : 2.2 (3.8)
min ≤ med ≤ max:
0 ≤ 1.4 ≤ 20.9
IQR (CV) : 1.1 (1.7)
101 distinct values 118 (100.0%) 0 (0.0%)
Nonanal [numeric]
Mean (sd) : 9.5 (7.9)
min ≤ med ≤ max:
0.4 ≤ 8.1 ≤ 36.4
IQR (CV) : 13.7 (0.8)
118 distinct values 118 (100.0%) 0 (0.0%)
Decanal [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 7.8
IQR (CV) : 0 (3)
20 distinct values 118 (100.0%) 0 (0.0%)
Benzaldehyde [numeric]
Mean (sd) : 14.1 (14.7)
min ≤ med ≤ max:
1.8 ≤ 7.8 ≤ 74.8
IQR (CV) : 15.6 (1)
118 distinct values 118 (100.0%) 0 (0.0%)
E_2_nonal [numeric]
Mean (sd) : 0.5 (0.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5.1
IQR (CV) : 0.8 (1.7)
45 distinct values 118 (100.0%) 0 (0.0%)
Phenylacetaldehyde [numeric]
Mean (sd) : 0 (0)
min ≤ med ≤ max:
0 ≤ 0 ≤ 0
IQR (CV) : 0 (1)
3 distinct values 118 (100.0%) 0 (0.0%)
Neryl_Geranyl_acetone [numeric]
Mean (sd) : 39.9 (42.8)
min ≤ med ≤ max:
2.6 ≤ 32.9 ≤ 312.7
IQR (CV) : 53.9 (1.1)
118 distinct values 118 (100.0%) 0 (0.0%)
Gamma_octalactone [numeric]
Mean (sd) : 11.4 (13.7)
min ≤ med ≤ max:
1.2 ≤ 6.4 ≤ 63.2
IQR (CV) : 9.1 (1.2)
118 distinct values 118 (100.0%) 0 (0.0%)
Beta_ionone [numeric]
Mean (sd) : 6.8 (4.9)
min ≤ med ≤ max:
3.5 ≤ 5.3 ≤ 50.2
IQR (CV) : 3 (0.7)
118 distinct values 118 (100.0%) 0 (0.0%)
BITC [numeric]
Mean (sd) : 4.5 (7.1)
min ≤ med ≤ max:
0 ≤ 2.6 ≤ 67.3
IQR (CV) : 5.3 (1.6)
93 distinct values 118 (100.0%) 0 (0.0%)
aroma_intensity_AR [numeric]
Mean (sd) : 52.6 (16.7)
min ≤ med ≤ max:
18.8 ≤ 49.4 ≤ 86
IQR (CV) : 23.9 (0.3)
113 distinct values 118 (100.0%) 0 (0.0%)
sweet_fruit_AR [numeric]
Mean (sd) : 35.3 (18.6)
min ≤ med ≤ max:
5.4 ≤ 31.6 ≤ 71.1
IQR (CV) : 33.2 (0.5)
116 distinct values 118 (100.0%) 0 (0.0%)
musty_off_note_AR [numeric]
Mean (sd) : 40.4 (12.3)
min ≤ med ≤ max:
13.1 ≤ 39.3 ≤ 74.8
IQR (CV) : 16.2 (0.3)
112 distinct values 118 (100.0%) 0 (0.0%)
fishy_AR [numeric]
Mean (sd) : 30.2 (16.5)
min ≤ med ≤ max:
0.6 ≤ 29 ≤ 68.9
IQR (CV) : 27 (0.5)
112 distinct values 118 (100.0%) 0 (0.0%)
citrus_AR [numeric]
Mean (sd) : 21.2 (18.9)
min ≤ med ≤ max:
0.9 ≤ 14.2 ≤ 71.4
IQR (CV) : 26.1 (0.9)
108 distinct values 118 (100.0%) 0 (0.0%)
floral_AR [numeric]
Mean (sd) : 21.6 (19.2)
min ≤ med ≤ max:
0.3 ≤ 13.9 ≤ 73.7
IQR (CV) : 22.3 (0.9)
109 distinct values 118 (100.0%) 0 (0.0%)
resistance_TX [numeric]
Mean (sd) : 42.1 (22.7)
min ≤ med ≤ max:
3.4 ≤ 40.9 ≤ 89.7
IQR (CV) : 35.8 (0.5)
115 distinct values 118 (100.0%) 0 (0.0%)
velvety_TX [numeric]
Mean (sd) : 45.4 (13.1)
min ≤ med ≤ max:
16.8 ≤ 44.5 ≤ 75.3
IQR (CV) : 17.5 (0.3)
113 distinct values 118 (100.0%) 0 (0.0%)
juiciness_TX [numeric]
Mean (sd) : 48.4 (19.4)
min ≤ med ≤ max:
6.8 ≤ 50.4 ≤ 85.4
IQR (CV) : 34.6 (0.4)
117 distinct values 118 (100.0%) 0 (0.0%)
dissolving_TX [numeric]
Mean (sd) : 54.9 (21.3)
min ≤ med ≤ max:
6.9 ≤ 56.4 ≤ 92
IQR (CV) : 28.8 (0.4)
117 distinct values 118 (100.0%) 0 (0.0%)
fibrous_TX [numeric]
Mean (sd) : 44.6 (13.5)
min ≤ med ≤ max:
15.9 ≤ 44.3 ≤ 84.6
IQR (CV) : 17.9 (0.3)
111 distinct values 118 (100.0%) 0 (0.0%)
flavour_intensity_FL [numeric]
Mean (sd) : 58.2 (13.7)
min ≤ med ≤ max:
14.1 ≤ 59 ≤ 87.2
IQR (CV) : 19.4 (0.2)
109 distinct values 118 (100.0%) 0 (0.0%)
sweetness_FL [numeric]
Mean (sd) : 54.8 (16.6)
min ≤ med ≤ max:
7 ≤ 57 ≤ 80.8
IQR (CV) : 22.7 (0.3)
114 distinct values 118 (100.0%) 0 (0.0%)
bitterness_FL [numeric]
Mean (sd) : 26.8 (18.7)
min ≤ med ≤ max:
2.4 ≤ 20.8 ≤ 96.2
IQR (CV) : 19.8 (0.7)
109 distinct values 118 (100.0%) 0 (0.0%)
musty_FL [numeric]
Mean (sd) : 42.1 (11.5)
min ≤ med ≤ max:
18.2 ≤ 39.7 ≤ 69.6
IQR (CV) : 16.8 (0.3)
112 distinct values 118 (100.0%) 0 (0.0%)
floral_FL [numeric]
Mean (sd) : 26.6 (16.4)
min ≤ med ≤ max:
1.9 ≤ 21.2 ≤ 66.6
IQR (CV) : 20.4 (0.6)
114 distinct values 118 (100.0%) 0 (0.0%)
bitter_AT [numeric]
Mean (sd) : 23.6 (17)
min ≤ med ≤ max:
5 ≤ 18.1 ≤ 91.1
IQR (CV) : 16.5 (0.7)
108 distinct values 118 (100.0%) 0 (0.0%)
sweet_AT [numeric]
Mean (sd) : 37.8 (15.5)
min ≤ med ≤ max:
1.5 ≤ 39.4 ≤ 66.9
IQR (CV) : 21 (0.4)
116 distinct values 118 (100.0%) 0 (0.0%)
metallic_AT [numeric]
Mean (sd) : 11.8 (6.1)
min ≤ med ≤ max:
0.8 ≤ 10.7 ≤ 34.7
IQR (CV) : 7 (0.5)
102 distinct values 118 (100.0%) 0 (0.0%)
prickly_AT [numeric]
Mean (sd) : 8.6 (4.7)
min ≤ med ≤ max:
1.4 ≤ 7.8 ≤ 23.3
IQR (CV) : 5.3 (0.5)
103 distinct values 118 (100.0%) 0 (0.0%)

Generated by summarytools 1.1.3 (R version 4.4.2)
2025-04-23

Sensory and metabolite Principal Component Analysis

Metabolites with odour activity values less than one likely play a very minor role in fruit flavour; due to the small ranges and fluctuations, their influence on flavour may be overestimated in modelling. Moreover, sensory terms including prickly aftertaste and metallic aftertaste poorly differentiated the genotypes throughout descriptive analysis. Therefore these variables were removed to focus on highly relevant flavour variables.

# cut VOC variables
no_aroma <- VOCs %>% 
  select(1:6,OAV) %>% 
  pivot_wider(names_from = species, values_from = OAV) %>% 
  select(Hexanal:BITC) %>% 
  summarise(across(where(is.numeric), ~ max(.x, na.rm = TRUE))) %>%
  select(where(function(x) any(x<1))) %>% 
  colnames()

# cut sensory variables
poor_lexicon_use <- c("prickly_AT", "metallic_AT")

Once these variables were removed the data was visualised using PCA to look at sensory and metabolite profiles of each genotype.

# data set up
remove_vars <- c("Glucose_prop","Fructose_prop","Sucrose_prop","label","Genotype2")
PCA_data <- full_data_matrix %>% 
  column_to_rownames(var = "ID") %>% 
  select(-all_of(c(poor_lexicon_use, no_aroma,remove_vars,"Beta_Cyclocitral")))
# Beta_Cyclocitral appears to have very low range and is exacerbating small differences due to presence absence

# PCA script custom bi-plot
PCA_vocs <- PCA(PCA_data, scale.unit = TRUE, 
                ncp = 5, graph = F, 
                quali.sup = 1:31) # omit all but sensory columns
 
# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>% 
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>% 
  top_n(17, contribution)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(sensory_meta) %>% 
  mutate(Category = snakecase::to_title_case(group),
         sensory_label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var))) 

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
                    as_tibble() %>% 
                    mutate(across(starts_with("Dim"), ~r*0.6*.x)) # this number scales arrow length

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>% 
               as.data.frame() %>% 
               rownames_to_column("ID") %>% 
               left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>% 
               group_by(label) %>% 
               summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
               left_join(genotype_table %>% distinct(label,Flesh, Genotype, Genotype2, assay))


# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) + 
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=0.20, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2, 
                  size=3, hjust=-0.2, vjust=0.5, box.padding = 0.2,
                  colour = "black", fontface = "italic") +
   scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh", 
        colour="Attribute", shape = "Havest Month") 

Figure 1: PCA biplot of 27 papaya genotype dipicting the variations in sensory profiles from quantitative descriptive analysis.

PCA_vocs <- PCA(PCA_data, scale.unit = TRUE, 
                ncp = 6, graph = F, 
                quali.sup = c(1:3,31:48)) # all chem PCA

# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>% 
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>% 
  top_n(22, contribution)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key, by = "var") %>% 
  mutate(var=str_replace_all(var, c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone", "E_2_nonal"="E-2-Nonanal","P_cymene"="P-Cymene","_"=" ")), group=str_replace_all(group, c("Acids"="Sugar/acid","Sugars"="Sugar/acid")))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
  as_tibble() %>% 
  mutate(across(starts_with("Dim"), ~r*0.6*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>% 
  as.data.frame() %>% 
  rownames_to_column("ID") %>% 
  left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>% 
  group_by(label,Flesh) %>% 
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  left_join(genotype_table %>% distinct(label, Flesh, Genotype, Genotype2,assay))

# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) + 
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=0.3, vjust=-0.6, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090","darkorange")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2, 
                  size=3, hjust=1.1, vjust=1.2, box.padding = 0.1,
                  colour = "black", fontface = "italic") +
  scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh", 
        colour="Attribute", shape = "Havest Month")

Figure 2: PCA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype.

There appears to be a significant batch effect associated with metabolites with highly skewed distributions. After removing decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal, the control genotypes are qualitatively similar in distance to what is observed in the sensory data.

PCA_vocs <- PCA(PCA_data, scale.unit = TRUE,
                ncp = 6, graph = F,
                quali.sup = c(1:3,21,23:26,31:48)) # omit decanal, methyl hexanoate, octanal, (E)-2-nonanal, nonanal

# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>%
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>%
  top_n(22, contribution)
# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>%
  as.data.frame() %>%
  select(1:2) %>%
  rownames_to_column(var = "var") %>%
  left_join(class_key) %>%
  mutate(var=str_replace_all(var, c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone", "E_2_nonal"="E-2-Nonanal","P_cymene"="P-Cymene","_"=" ")), group=str_replace_all(group, c("Acids"="Sugar/acid","Sugars"="Sugar/acid")))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs,
                       element = "var",
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE],
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))),
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>%
  as_tibble() %>%
  mutate(across(starts_with("Dim"), ~r*0.6*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>%
  as.data.frame() %>%
  rownames_to_column("ID") %>%
  left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>%
  group_by(label,Flesh) %>%
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  left_join(genotype_table %>% distinct(label, Flesh, Genotype, Genotype2,assay))

# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) +
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) +
  geom_hline(yintercept = 0, alpha = 0.6,
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6,
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2),
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=0.3, vjust=-0.6, size=3.2, box.padding = 0.1) +
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090","darkorange")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(),
                     panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2,
                  size=3, hjust=1.1, vjust=1.2, box.padding = 0.1,
                  colour = "black", fontface = "italic") +
  scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh",
        colour="Attribute", shape = "Havest Month")

Figure 3: PCA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype. Excluding decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal.

Modelling the interactions between sensory and metabolite profiles

Variable influences on traits were assessed using two complementary approaches: (1) generalised boosted regression (Friedman, 2001) to capture both main effects and interactions (x-axis), optimized using 10-fold cross-validation, and (2) Bayesian linear regression (BayesA; Meuwissen et al., 2001) to estimate individual additive effects without interactions (y-axis). Only variables that stood out from the rest were labelled due to space limitations.

scaled_matrix <- PCA_data %>% select(-c(1:3,21,23:26)) %>% scale()

# Sensory subsetting
# column naming
traits <- colnames(scaled_matrix) [23:40]
gbm_name <- c(seq(from = 1, to = 35, by = 2))
beta_name <- c(seq(from = 2, to = 36, by = 2))

set.seed(11)

for (i in 1:18) {
  
  # caret GBM
  carX <- as.matrix(cbind(Y=scaled_matrix[,22+i], scaled_matrix[,1:22]))
  
  grid <- expand.grid(.n.trees = c(500, 750, 1000), 
                      .interaction.depth = c(5,7,10), 
                      .shrinkage = c(0.01,0.02,0.05), 
                      .n.minobsinnode = c(3,5))
  
  trnCtrl <- trainControl(method = "cv", number = 10)
  
  gbm <- train(Y ~ ., data = carX, method = "gbm", trControl = trnCtrl, tuneGrid = grid)
  
  # BGLR
  trnY <- scaled_matrix[,22+i]
  
  ETA <- list(list(X = scaled_matrix[,1:22], model = 'BayesA'))
  
  fit_BA <- BGLR(y = trnY, 
                 ETA = ETA, 
                 nIter = 30000, 
                 burnIn = 10000,
                 thin = 100, 
                 saveAt = '', 
                 df0 = 5, 
                 S0 = NULL,
                 weights = NULL, 
                 R2 = 0.5)

 # write results table
  if(i == 1){
    wts <- as.data.frame(varImp(object = gbm)$importance) %>% 
           cbind(fit_BA$ETA[[1]]$b)
    names(wts)[i] <- paste0(traits[i],"_gbm")
    names(wts)[i+1] <- paste0(traits[i],"_beta")
  } else {
    wts <- cbind(wts,varImp(object = gbm)$importance,fit_BA$ETA[[1]]$b)
    names(wts)[gbm_name[i]] <- paste0(traits[i],"_gbm")
    names(wts)[beta_name[i]] <- paste0(traits[i],"_beta")
  }
}
# Plot prep ------------
bglr_data <- wts %>% 
  select(all_of(beta_name)) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key) %>% 
  pivot_longer(cols = 2:19, values_to = "y", names_to = "trait") %>% 
  mutate(trait=str_replace_all(trait,"_beta",""))

gbm_data <- wts %>% 
  select(all_of(gbm_name)) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key) %>% 
  pivot_longer(cols = 2:19, values_to = "x", names_to = "trait") %>% 
  mutate(trait=str_replace_all(trait,"_gbm",""))

all_plot_data <- left_join(bglr_data,gbm_data[c(1,4:5)],by=c("var" = "var", "trait" = "trait")) %>% 
                mutate(sensory_group = str_sub(trait, - 2, - 1),
                       trait_label = str_sub(trait,0,-4),
                       trait_label = str_replace_all(trait_label,"_"," "),
                       trait_label = str_to_sentence(trait_label),
                       label = str_replace_all(label,c("total sugar"="Total sugar",
                                                       "wet fraction"="Water content")))
# Filter for AR labels -------
ar_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_min(y, n=4),
    
    all_plot_data %>%  filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_max(y, n=4),
    
    all_plot_data %>% filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_max(x, n=8)) %>%
  
  distinct()
  
my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Need labels to remove
# Determined by initially plotting AR_label 

AR_intensity <- c("Terpinolene","Heptanal","Terpinene","Neryl/Geranyl Acetone","Sucrose",
                  "Hexanal")
AR_citrus <- c("Glucose","Hexanal","Limonene","Sucrose","Heptanal","Sulcatone","Beta ionone","Terpinene")
AR_fishy <- c("Sucrose","Glucose","P-Cymene","pH","Beta ionone","Sulcatone","TA")
AR_floral <- c("Neryl/Geranyl Acetone","Beta ionone", "Hexanal","Sulcatone","Neryl/Geranyl Acetone")
AR_musty <- c("Linalool oxide", "Terpinolene", "GTR","Neryl/Geranyl Acetone")
AR_sweet <- c("BITC","Beta ionone","Linalool oxide","P-Cymene","Hexanal","GTR","pH","Total sugar")

# Filter labels and plot

ar_label1 <- ar_label %>%
  filter(!label %in% AR_intensity & trait=="aroma_intensity_AR"|
         !label %in% AR_citrus & trait == "citrus_AR" |
         !label %in% AR_fishy & trait == "fishy_AR"|
         !label %in% AR_floral & trait == "floral_AR"|
         !label %in% AR_musty & trait == "musty_off_note_AR"|
         !label %in% AR_sweet & trait == "sweet_fruit_AR")
  

#plot AROMA trait variables
ggplot(all_plot_data %>% filter(sensory_group == "AR")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data[all_plot_data$sensory_group=="AR",] %>% select(y)), 
       max(all_plot_data[all_plot_data$sensory_group=="AR",] %>% select(y))) +


  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = ar_label1,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = -.1,
                  vjust = -0.4,
                  size = 3,
                  box.padding = 0.1) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 4: Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes.

# Filter for FL labels -------
FL_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_min(y, n=5), 
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_max(y, n=5),
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_max(x, n=8)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Need labels to remove
# Determined by initially plotting FL_label 

FL_bitterness <- c("Gamma octalactone", "Sucrose","Hexanal","Neryl/Geranyl Acetone","Beta ionone","Glucose","Terpinolene","Water content","Brix")
FL_intensity <- c("Sulcatone", "Beta ionone","Linalool","TA","BITC")
FL_floral <- c("Brix","Linalool oxide","Neryl/Geranyl Acetone","Terpinene")
FL_musty <- c("Terpinene","Sulcatone")
FL_sweet <- c("Fructose","Linalool oxide","Glucose","GTR", "TA","pH","Gamma otalactone")

# Filter labels and plot

FL_label1 <- FL_label %>%
  filter(!label %in% FL_bitterness & trait=="bitterness_FL"|
           !label %in% FL_intensity & trait == "flavour_intensity_FL" |
           !label %in% FL_floral & trait == "floral_FL"|
           !label %in% FL_musty & trait == "musty_FL"|
           !label %in% FL_sweet & trait == "sweetness_FL")


# Plot flavour trait variables
ggplot(all_plot_data %>% filter(sensory_group == "FL")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "FL") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "FL") %>% select(y))) +

  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = FL_label1,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = -0,
                  vjust = -0.5,
                  size = 3,
                  box.padding = 0.1) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 5: Variation in papaya taste characteristics explained by targeted metabolites in 27 papaya genotypes.

# First filter for AT labels -------
AT_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_min(y, n=4), 
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_max(y, n=2),
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_max(x, n=7)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Plot aftertaste trait variables
ggplot(all_plot_data %>% filter(sensory_group == "AT")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "AT") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "AT") %>% select(y))) +

  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = AT_label,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = 0,
                  vjust = 0,
                  size = 3,
                  box.padding = 0.15) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 6: Variation in papaya aftertaste characteristics explained by targeted metabolites in 27 papaya genotypes.

#First filter for TX labels -------
  TX_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "TX") %>%
      group_by(trait) %>% slice_min(y, n=2), 
    all_plot_data %>% filter(sensory_group == "TX") %>% 
      group_by(trait) %>% slice_max(y, n=2),
    all_plot_data %>% filter(sensory_group == "TX") %>% 
      group_by(trait) %>% slice_max(x, n=2)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Plot TEXTURE trait variables
ggplot(all_plot_data %>% filter(sensory_group == "TX")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "TX") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "TX") %>% select(y))) +
  
  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = TX_label,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = 0,
                  vjust = 0,
                  size = 3,
                  box.padding = 0.15) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 7: Variation in papaya texture characteristics explained by targeted metabolites in 27 papaya genotypes.

Modelling consumer liking

Trends within the individual consumer liking scores for each genotype were identified using the affinity propagation clustering algorithm (Frey and Dueck (2007)). The grouping of individual consumers into three distinct clusters reveals the interaction between their regular papaya consumption habit and how they scored each genotype during tasting. Moreover, there are indications that multiple genotypes maybe be accepted into the Australian market by different clusters of consumers. ### Cluster analysis

consumer_data <- read_xlsx("data/consumer_data.xlsx")

z <- consumer_data %>% 
  select(participant,liking,var) %>%
  pivot_wider(names_from = var, 
              values_from = liking)

# to plot the eps values
set.seed(11)
a = apcluster(negDistMat(r=2), x=z, q=0)

# cluster exemplars
clusters_eg <- a@exemplars %>% unname()
knitr::kable(consumer_data[clusters_eg,],
      col.names = c("Participant", "Genotype", "Flesh Colour","Liking Score","Like Comment","Dislike Comment","Gender","Age","Consumption Habit"),
      caption = "Cluster exemplars generated using apcluster")

# consumer clusters
c1 <- a@clusters[[1]]%>% unname()
c2 <- a@clusters[[2]]%>% unname()
c3 <- a@clusters[[3]]%>% unname()

consumer_clusters <- z %>% mutate(Cluster = c(1:125))
consumer_clusters$Cluster <- as.character(car::recode(consumer_clusters$Cluster,"c1='1';c2='2';c3='3'"))

# attach cluster membership to consumer data
cluster_data <- 
  left_join(consumer_clusters, 
            consumer_data %>% select(participant,Gender,age,habit) %>% distinct()) %>% 
  relocate(Cluster:habit, .before = RB1) %>% 
  select(-participant)

# Summarise cluster demographic data using fisher test and kruskal
cluster_membership <- cluster_data %>% select(1:4)

tab1 <- CreateTableOne(vars = c("Gender", "age", "habit"),
                       strata = "Cluster",
                       data = cluster_membership,
                       factorVars = c("Gender", "habit"))

# very convoluted way to make it look good in html
tab_df <- 
  as.data.frame(print(tab1)) %>%
  rownames_to_column(var = "Consumer Cluster")

# Create flextable
ft <- flextable(tab_df) %>%
  theme_vanilla() %>%
  autofit()

# summarise cluster liking for each genotype

cluster_summary <- consumer_clusters %>% select(-participant) %>%
  pivot_longer(RB1:`Moonlight 1`,names_to = "var",values_to = "liking")

# ANOVA and Post-hoc 
cluster_aov <- aov(liking ~ var * Cluster, data = cluster_summary)
cluster_ph <- SNK.test(cluster_aov, c("var","Cluster"))

# Extract data for plotting
aov_plot <- 
  left_join(cluster_ph$groups %>% rownames_to_column(var="var"),
            cluster_ph$means %>% rownames_to_column(var="var")) %>% 
  separate(var,
           into = c("var", "cluster"),
           sep = ":")
 
# second y-axis values
lam_breaks <- c(0, 13, 23, 33, 45, 50, 55, 67, 77, 87, 100)
lam_labels <- c('Greatest imaginable dislike',
                'Dislike extremely',
                'Dislike very much',
                'Dislike moderately',
                'Dislike slightly',
                'Neither like nor dislike',
                'Like slightly',
                'Like moderately',
                'Like very much',
                'like extremely',
                'Greatest imaginable like')
ggplot(aov_plot) +
aes(x = reorder(var,-liking), 
    y = liking, 
    group = interaction(var, cluster), 
    color = cluster, fill = cluster) + 
  # Add boxplot-style elements
  geom_boxplot(
    aes(ymin = Min,
        lower = Q25,
        middle = Q50,
        upper = Q75,
        ymax = Max),
    stat = "identity",
    alpha = 0.3,  
    width = 0.9,
    position = position_dodge(0.9)) +
  # Add post-hoc letters
  geom_text(aes(label = groups, y = Q50 + 2),
            position = position_dodge(0.9),
            size = 2.5) +
  # Customize colors
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster') +
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster',
                  labels= parse(text = c(
                    "1~(italic(n)==61)",
                    "2~(italic(n)==24)",
                    "3~(italic(n)==40)"))) +
  # remove letter in legend
  guides(color = "none") +
  # Customize theme
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(fill = NA, color = 'black'),
        legend.position = 'bottom', 
        legend.box = "horizontal",   
        legend.direction = "horizontal",  
        legend.margin = margin(t = 0, b = 10),  
        legend.title = element_text(size = 10),  
        legend.text = element_text(size = 9)) +
  # Add labels
  labs(x = 'Papaya Genotype',
       y = 'Consumer Liking Score (0-100)') +
  # Add secondary y-axis for LAM scale with no expansion
  scale_y_continuous(
    limits = c(0, 100),
    expand = c(0, 0), 
    sec.axis = sec_axis(
      ~.,
      breaks = lam_breaks,
      labels = lam_labels,
      name = 'Labeled Affective Magnitude (LAM) Scale'))

Figure 8: Consumer liking score for nine papaya genotypes recorded by 125 untrained panelists. Each panelist was grouped into a cluster based on the affinity propagation clustering algorithm. There is an interaction between geotype liking scores and cluster membership (ANOVA; p = <0.001). The differences between genotype liking scores among all clusters are indicated by different letters (SNK: p = <0.05).

Table 2: Breakdown of demographic data and papaya consumption habits within each consumer cluster. Kruskal wallis was applied to age and fishers test was applied to gender and papaya consumption habit.

# Display the table
ft

Consumer Cluster

1

2

3

p

test

n

61

24

40

Gender (%)

0.541

Female

43 (70.5)

15 (62.5)

23 (57.5)

Male

17 (27.9)

9 (37.5)

16 (40.0)

Non-binary

1 ( 1.6)

0 ( 0.0)

0 ( 0.0)

Unspecified

0 ( 0.0)

0 ( 0.0)

1 ( 2.5)

age (mean (SD))

36.56 (10.90)

35.92 (12.60)

36.17 (10.70)

0.968

habit (%)

0.002

1-3 times per month

23 (37.7)

3 (12.5)

20 (50.0)

2-4 times a week

3 ( 4.9)

1 ( 4.2)

2 ( 5.0)

never

2 ( 3.3)

8 (33.3)

2 ( 5.0)

once a week

6 ( 9.8)

1 ( 4.2)

2 ( 5.0)

rarely

27 (44.3)

11 (45.8)

14 (35.0)

  # PCA clusters as variables -----
pca_data <- cluster_data %>%
  pivot_longer(RB1:`Moonlight 1`,names_to = "var",values_to = "liking") %>%
  group_by(Cluster,var) %>%
  summarise(mean_liking = mean(liking)) %>%  
  pivot_wider(names_from = "Cluster",values_from = "mean_liking") %>% 
  column_to_rownames(var = "var")

plot_data <- PCA(pca_data, graph = F)
# make a dataframe with variable coordinates for arrows
PCAvar_coords <- plot_data$var$coord %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% mutate(var=paste("Cluster",var))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(plot_data, element = "var", 
                       result = c("coord", "contrib", "cos2"))
colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(plot_data$ind$coord[, drop = FALSE], stringsAsFactors = TRUE)
colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))
# update the variable coordinates
PCAvar_plot_data <- PCAvar_coords %>% as_tibble() %>% 
  mutate(across(starts_with("Dim"), ~r*0.8*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- plot_data$ind$coord %>% 
  as.data.frame() %>% 
  rownames_to_column("Genotype2") %>% 
  left_join(genotype_table %>% select(Genotype2,Flesh) %>% distinct())

# choose colours
my_colours <- c("#E69F00", "#56B4E9", "#009E73")

# Get component variance information for axis labels
percentVar <- plot_data$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(PCAvoctable) + 
  aes(x = Dim.1, y = Dim.2, fill=Flesh) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(xend=Dim.1, yend=Dim.2, colour = var), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.5,
               arrow = arrow(angle = 30, length = unit(0.2, "cm")),
               lineend = "round",linejoin = "mitre") +
  geom_text_repel(mapping = aes(x=Dim.1, y=Dim.2, label=var),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=-0.5, size=3.2, colour= my_colours) + 
  scale_fill_manual(values = c("#E31A1C","gold2")) +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster',
                  labels= parse(text = c(
                    "1~(italic(n)==61)",
                    "2~(italic(n)==24)",
                    "3~(italic(n)==40)"))) +
  geom_point(shape = 21,alpha = 0.9, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12),
                     plot.title = element_text(hjust = 0.5)) +
  geom_text_repel(label = PCAvoctable$Genotype2,
                  size=3, hjust=-0.15, vjust=-0.3,
                  colour = "black", fontface = "italic") +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), 
        fill="Flesh Colour", 
        colour="Consumer Cluster") 

Figure 9: Principal component analysis of the liking scores for nine papaya genotypes reported by three distinct consumer clusters (consumer clusters were generated using agglomerative clustering algorithm): Cluster 1, n = 61; Cluster 2, n = 21; Cluster 3, n = 40.

There is an inherent limitation in the dataset whereby fruit included in the consumer study was not able to be used for chemical analysis. This is due to the logistical issues regarding adequate fruit collection and processing: the fruit that was collected was completely consumed by the panelists. To deal with this limitation best linear unbiased estimates were combined with previously collected sensory and metabolite data to represent the broad differences between genotype liking scores. This caveat likely leads to the underestimation of complex nuances between variables that influence papaya fruit liking. Therefore, the model generated from this data highlights larger variable variation between genotypes.

Table 3: Estimated marginal means from the mixed effects model including genotype as a fixed effect and participant as a random effect

liking_fit <- lmer(liking ~ var +
                     (1|participant),
                   data = consumer_data)

blue_df <- 
  as.data.frame(
    emmeans(liking_fit, 
            specs = "var")) %>% 
  rename("Genotype2" = "var") 

kable(blue_df,
      col.names = c("Genotype","Least-squares mean",
                    colnames(blue_df)[3:6]),
      align = c("l",rep("c",5))) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE)
Genotype Least-squares mean SE df lower.CL upper.CL
1B 39.664 1.66757 607.539 36.3891 42.9389
Hybrid 1 62.720 1.66757 607.539 59.4451 65.9949
Hybrid 6 70.104 1.66757 607.539 66.8291 73.3789
ML3-3-13 50.784 1.66757 607.539 47.5091 54.0589
Moonlight 1 56.832 1.66757 607.539 53.5571 60.1069
RB1 71.888 1.66757 607.539 68.6131 75.1629
RS5 70.368 1.66757 607.539 67.0931 73.6429
Solo 64.888 1.66757 607.539 61.6131 68.1629
Sunlight 1 65.992 1.66757 607.539 62.7171 69.2669
model_data <- 
  left_join(PCA_data %>% rownames_to_column(var = "ID"), genotype_table) %>% 
  left_join(blue_df) %>%
  filter(!is.na(emmean)) %>% 
  column_to_rownames(var = "ID") %>% 
  select(-c(assay:Genotype,SE:upper.CL,label:Genotype2)) %>% 
  relocate(Flesh, .after = "emmean") %>% 
  select(-c(Decanal,Methyl_hexanoate,Octanal,E_2_nonal, Nonanal))

Models indicating variable influence on fruit liking can be produced from red-fleshed and yellow-fleshed data separately; however, with limited power due to sample size. Moreover, the liking scores are more accurately compared as a whole because the consumer participants tasted all samples in one sitting, as opposed to only red-fleshed and then only yellow-fleshed samples.

Interestingly, it became clear in consumer tasting that RS5 (T2-6-5.27.12) wasn’t an inherently bitter genotype; although it was thought to be from earlier sensory panels. Due to the constants of genotype liking, any samples displaying uncharacteristic flavour (likely due to poor fruit maturation) cause significant inaccuracies in modelling efforts. These samples were detected by comparing genotypes with equivalent liking scores. Therefore, samples were removed for further modelling: 03-302 and 03-669.

# find the source of confusion in cosumer liking modelling - filter for RS5/T2-6-5.27.12 and equivalently liked genotypes
sample_review <- full_data_matrix %>% 
  select(-(5:40)) %>% 
  filter(Genotype %in% c("T2-6-5.27.12","RB1","Hybrid_6")) %>% 
  column_to_rownames(var = "ID")

# PCA
pca_data <- PCA(sample_review, quali.sup = 1:4, graph = F)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- pca_data$var$coord %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  slice_max(abs(Dim.1)+abs(Dim.2), n = 10) %>% 
  left_join(sensory_meta) %>% 
  mutate(Category = snakecase::to_title_case(group),
         sensory_label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var)))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(pca_data, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(pca_data$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
                    as_tibble() %>% 
                    mutate(across(starts_with("Dim"), ~r*0.7*.x)) # this number scales arrow length

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- pca_data$ind$coord %>% 
               as.data.frame() %>% 
               rownames_to_column("ID") %>% 
               left_join(genotype_table)

# choose colours
my_colours <- c("Aug22" = "#E69F00", "Feb23" = "#56B4E9", "Feb24" = "#009E73")


# Get component variance information for axis labels
percentVar <- pca_data$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(PCAvoctable) + 
  aes(x = Dim.1, y = Dim.2, fill = assay, shape = Genotype2) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=0.20, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090")) +
  geom_point(alpha = 1, size=2.5,color = "black") +
  xlim(min(PCAvar_plot_data$Dim.1-2.9), max(PCAvar_plot_data$Dim.1)) + 
  ylim(min(PCAvar_plot_data$Dim.2-0.3), max(PCAvar_plot_data$Dim.2)+0.9) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = PCAvoctable$ID, 
                  size=3, hjust=-0.2, vjust=0.5, box.padding = 0.2,
                  colour = "black", fontface = "italic") +
   scale_shape_manual(values = c(21,22,24)) + scale_fill_manual(values = my_colours,
                                                                guide = guide_legend(override.aes = list(shape = 22, size = 6))) +
   guides(colour=guide_legend(override.aes = list(label="")))+
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Harvest Month", 
        colour="Attribute", shape = "Genotype") 

# issue with 2 samples during sensory: overly bitter and less sweet than expected
bitter_samples <- c("03-302","03-669")

# update model data
model_data <- model_data[!rownames(model_data) %in% bitter_samples,]

Figure 10: PCA Bi-plot comparing red-fleshed papaya genotypes RB1, Hybrid 6 and RS5. All three genotypes received similar liking scores from 125 untrained panelists.

After these two samples were removed from the dataset, modelling the effect of sensory traits and metabolite traits on papaya liking scores better reflects what was observed. Similar to before gbm and bayesA were implemented to determine how the different variable groups influence papaya liking.

# Function to run models for a specific flesh type
run_models <- function(data, flesh_type, met_cols, sens_cols, y_col = 41) {
  # 
  # Filter and prepare data
  pap <- data %>%
    filter(grepl(flesh_type, Flesh, perl = TRUE)) %>%
    select(-Flesh) %>% 
    scale()

  # Define model parameters based on flesh type
  model_params <- if(flesh_type == "Yellow") {
    list(
      n_trees = c(500),
      int_depth = c(5),
      shrinkage = c(0.01),
      min_obs = c(1),
      cv_folds = 5,
      bglr_iter = 30000,
      bglr_burnin = 10000
    )
  } else {
    list(
      n_trees = c(500, 750, 1000),
      int_depth = c(5, 7, 10),
      shrinkage = c(0.01, 0.02, 0.05),
      min_obs = c(3, 5),
      cv_folds = 10,
      bglr_iter = 30000,
      bglr_burnin = 10000
    )
  }

  # Function to run a single model set (metabolites or sensory)
  run_model_set <- function(cols, type) {
    # Prepare data
    carX <- as.matrix(cbind(Y = pap[,y_col], pap[,cols]))

    # GBM
    grid <- expand.grid(
      .n.trees = model_params$n_trees,
      .interaction.depth = model_params$int_depth,
      .shrinkage = model_params$shrinkage,
      .n.minobsinnode = model_params$min_obs
    )

    gbm_model <- train(
      Y ~ .,
      data = carX,
      method = "gbm",
      trControl = trainControl(method = "cv", number = model_params$cv_folds),
      tuneGrid = grid
    )

    # BGLR
    bglr_model <- BGLR(
      y = pap[,y_col],
      ETA = list(list(X = pap[,cols], model = 'BayesA')),
      nIter = model_params$bglr_iter,
      burnIn = model_params$bglr_burnin,
      thin = 100,
      saveAt = '',
      df0 = 5,
      R2 = 0.5
    )

    # Combine results
    bind_rows(
      varImp(gbm_model)$importance %>%
        rownames_to_column("var") %>%
        mutate(model = paste0(flesh_type, "_", type, "_gbm")),
      tibble(
        var = colnames(pap[,cols]),
        Overall = bglr_model$ETA[[1]]$b,
        model = paste0(flesh_type, "_", type, "_beta")
      )
    ) %>%
      rename(coord = Overall)
  }

  # Run both model sets and combine results
  bind_rows(
    run_model_set(met_cols, "mets"),
    run_model_set(sens_cols, "sens")
  )
}

# run functions
set.seed(11)
# flesh_types <- c("Red", "Yellow", "Red_Yellow")
flesh_types <- c("Red", "Yellow", "Red|Yellow")
met_cols <- 1:22
sens_cols <- 23:40

results <- flesh_types %>%
  map(~run_models(model_data, ., met_cols, sens_cols)) %>%
  list_rbind() %>% 
  select(var, model, coord) %>% 
  mutate(model = sub("|", "_", model, fixed = TRUE))
# liking variable plots
create_importance_plot <- function(data,
                                 data_type = c("mets", "sens"),
                                 variety = "Red & Yellow",
                                 label_data,
                                 color_palette,
                                 group_breaks,
                                 facet_by_variety = FALSE,
                                 facet_scales = "free_y",
                                 facet_ncol = NULL,
                                 facet_theme = NULL) {
  # Input validation
  data_type <- match.arg(data_type)

  # Common theme settings
  plot_theme <- theme_bw() +
    theme(
      plot.background = element_blank(),
      panel.background = element_blank(),
      panel.border = element_rect(linetype = "solid", colour = "Black"),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.title = element_text(size = 11),
      axis.title.x = element_text(vjust = -2),
      axis.title.y = element_text(vjust = 2),
      axis.text = element_text(size = 8),
      legend.background = element_blank(),
      strip.background = element_blank(),
      strip.text = element_text(size = 11, face = "bold"),
      panel.spacing.y = unit(0, "cm", data = NULL),
      legend.position = "bottom"
    )

  # Filter data based on type and variety
  wts_filtered <- data %>%
    filter(data_type == !!data_type) %>%
    {if(!facet_by_variety) filter(., variety == !!variety) else .}

  # Filter labels
  labels_filtered <- label_data %>%
    filter(data_type == !!data_type) %>%
    {if(!facet_by_variety) filter(., variety == !!variety) else .}

  # Create base plot
  p <- ggplot(wts_filtered) +
    aes(x = x, y = y, fill = group, colour = group, label = label) +
    geom_hline(
      yintercept = 0,
      alpha = 0.6,
      linetype = "dashed",
      linewidth = 0.5
    ) +
    plot_theme +
    xlim(0, 100) +
    {if(!facet_by_variety) ylim(min(wts_filtered$y), max(wts_filtered$y))} +
    geom_point(shape = 21, alpha = 0.7, size = 2.5) +
    geom_text_repel(
      data = labels_filtered,
      aes(x = x, y = y, label = label, color = group),
      hjust = -0.15, vjust = 0, size = 3, box.padding = 0.10
    ) +
    scale_fill_manual(breaks = group_breaks, values = color_palette) +
    scale_color_manual(breaks = group_breaks, values = color_palette) +
    guides(colour = "none") +
    labs(
      x = "Variable importance (GBM)",
      y = "β coefficient (BayesA)",
      fill = ""
    )

  # Add faceting if requested
  if(facet_by_variety) {
    p <- p +
      facet_wrap(~variety,
                 scales = facet_scales,
                 ncol = facet_ncol) +
      {if(!is.null(facet_theme)) facet_theme else
        theme(
          strip.background = element_rect(fill = "white"),
          strip.text = element_text(size = 11, face = "bold")
        )
      }
  }

  return(p)
}

# Prepare the data
wts <- results %>%
  left_join(rbind(class_key, sensory_meta), by = "var") %>%
  mutate(
    variety = case_when(
      str_detect(model, "Red_Yellow") ~ "Red & Yellow",
      str_detect(model, "Red") ~ "Red",
      str_detect(model, "Yellow") ~ "Yellow"
    ),
    data_type = str_extract(model, "mets|sens"),
    model = str_extract(model, "gbm|beta"),
    label = str_replace_all(label,"total sugar","Total sugar")
  ) %>%
  pivot_wider(names_from = model, values_from = coord) %>%
  rename(x = gbm, y = beta)

# Define colors
my_colours <- c(
  "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22"
)

# Prepare labels
mets_labels <- rbind(
  wts %>% group_by(variety, data_type) %>% slice_min(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(x, n = 6)
) %>% distinct()

sens_labels <- rbind(
  wts %>% group_by(variety, data_type) %>% slice_min(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(y, n = 7),
  wts %>% group_by(variety, data_type) %>% slice_max(x, n = 6)
) %>% distinct()

# Define group breaks
mets_breaks <- c("Carotenoid/Terpenes", "Lipid-derived", "Phenylalanine-derived",
                 "Sugars", "Acids", "Other")
sens_breaks <- c("Flavour", "Aftertaste", "Aroma", "Texture")

# metabolite variable plot
metabolites_plot<- create_importance_plot(
  data = wts,
  data_type = "mets",
  variety = "Red & Yellow",
  label_data = mets_labels,
    color_palette = my_colours,
  group_breaks = mets_breaks,
  facet_by_variety = FALSE
)
# sensory variable plot
Sensory_plot <- create_importance_plot(
  data = wts,
  data_type = "sens",
  variety = "Red & Yellow",
  label_data = sens_labels,
  color_palette = my_colours,
  group_breaks = sens_breaks,
  facet_by_variety = FALSE
)
metabolites_plot

Figure 11: Metabolite characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Sensory_plot

Figure 12: Sensory characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Variance decomposition analysis is implemented to understand the variation in liking scores between the genotypes based on class groupings for sensory and metabolite variables. Such as taste or aroma for sensory characteristics and terpenes (volatile organic compound) or sugars (non-volatile organic compound) for metabolite variables. This provides a broader sense of the variables influencing fruit liking scores.

# update model data
model_data <- model_data %>% select(-Flesh)

# Function to create Gaussian kernel matrix------------
create_kernel_matrix <- function(data, vars) {
    W <- data %>%
        select(all_of(vars)) %>%
        as.matrix()

    G <- as.matrix(dist(W, method="euclidean", diag=TRUE, upper=TRUE))
    K <- 1 - G/max(G, na.rm=TRUE)
    return(K)
}

# Main variance decomposition function
variance_decomp <- function(data, var_groups, response = "emmean") {
    # Create list to store kernel matrices
    kernel_matrices <- list()

    # Generate kernel matrix for each group
    for(group_name in names(var_groups)) {
        kernel_matrices[[group_name]] <- create_kernel_matrix(
            data,
            var_groups[[group_name]]
        )
    }

    # Create ETA list for BGLR
    ETA <- lapply(kernel_matrices, function(K) {
        list(K = K, model = 'RKHS')
    })

    # Extract response variable
    
    y <- scale(data)[,response]

    # Fit BGLR model
    fm <- BGLR::BGLR(
        y = y,
        ETA = ETA,
        nIter = 30000,
        burnIn = 10000,
        thin = 100,
        R2 = 0.5,
        saveAt = ""
    )

    # Calculate variance components and proportions
    varE <- fm$varE  # Residual variance
    var_components <- sapply(fm$ETA, function(x) x$varU)
    total_var <- sum(var_components) + varE

    # Calculate proportions for each group
    var_proportions <- c(var_components, 
                         residual = varE) / total_var

    # Create results list
    results <- list(
        model = fm,
        variance_components = var_components,
        residual_variance = varE,
        variance_proportions = var_proportions,
        total_variance = total_var
    )

    return(results)
}

# Create list of metabolite variables----------
included_vars <- class_key %>% 
  filter(var %in% colnames(model_data))

phe <- included_vars %>%
    filter(group == "Phenylalanine-derived" & !var == "GTR_conc") %>%
     select(var) %>% pull()
lip <- included_vars %>%
    filter(group == "Lipid-derived") %>%
    select(var) %>% pull()
car <- included_vars %>%
    filter(group == "Carotenoid/Terpenes") %>%
    select(var) %>% pull()
sugar <- included_vars %>%
    filter(group == "Sugars",
           !str_detect(var,"_prop")) %>%
    select(var) %>% pull()
acid <- c("TA")
GTR <- c("GTR_conc")

# Create groups list
mets_groups <- list(
    phe = phe,
    lip = lip,
    car = car,
    sugar = sugar,
    acid = acid,
    GTR = GTR
)

# Metabolites analysis
mets_results <- variance_decomp(model_data, mets_groups, response = "emmean")

# Organize data with row groups
mets_table_data <- 
  data.frame(category = names(mets_results$variance_proportions),
             proportion = mets_results$variance_proportions) %>%
  mutate(proportion = round(proportion * 100, 1),
         group_type = case_when(
           category %in% c("phe", "lip", "car") ~ "Volatiles",
           category %in% c("sugar", "acid", "GTR") ~ "Non-volatiles",
           TRUE ~ "Residual"),
         category = 
           str_replace_all(category,
                           c("phe" = "Phenylalanine-derived",
                             "lip" = "Lipid-derived",
                             "car" = "Carotenoid/Terpenes",
                             "sugar" = "Sugars",
                             "acid" = "Acids",
                             "GTR" = "GTR")))
# write table
mets_table <- mets_table_data %>%
    select(group_type, category, proportion) %>%
    gt(
        groupname_col = "group_type"
    ) %>%
    tab_header(
        title = "Variance Components Analysis",
        subtitle = "Proportion of Variance in Liking Scores Explained by Metabolite Groups"
    ) %>%
    fmt_percent(
        columns = proportion,
        decimals = 1,
        scale_values = FALSE
    ) %>%
    cols_label(
        category = "Metabolite Group",
        proportion = "Variance Explained (%)"
    ) %>%
    summary_rows(
        groups = c("Non-volatiles", "Residual", "Volatiles"),
        columns = proportion,
        fns = list("Group Total" = ~sum(.)),
        fmt = ~ fmt_percent(., decimals = 1, scale_values = FALSE)
    ) %>%
    grand_summary_rows(
        columns = proportion,
        fns = list("Total" = ~sum(.)),
        fmt = ~ fmt_percent(., decimals = 1, scale_values = FALSE)
    ) %>%
    tab_style(
        style = cell_borders(
            sides = "bottom",
            weight = px(2)
        ),
        locations = cells_row_groups()
    ) %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )

# Create list of Sensory variables -------
include_vars <- sensory_meta %>% 
  filter(var %in% colnames(model_data))
AR <- include_vars %>%
    filter(group == "Aroma") %>%
    select(var) %>% pull()
TX <- include_vars %>%
    filter(group == "Texture") %>%
    select(var) %>% pull()
FL <- include_vars %>%
    filter(group == "Flavour") %>%
    select(var) %>% pull()
AT <- include_vars %>%
    filter(group == "Aftertaste",
           !str_detect(var,"_prop")) %>%
    select(var) %>% pull()

# Create groups list
sens_groups <- list(
    Aroma = AR,
    Texture = TX,
    Flavour = FL,
    Aftertaste = AT)

# Sensory analysis
sens_results <- variance_decomp(model_data, sens_groups, response = "emmean")

# Organize data with row groups
sens_table_data <- 
  data.frame(category = names(sens_results$variance_proportions),
             proportion = sens_results$variance_proportions) %>%
  mutate(proportion = round(proportion * 100, 1),
         category = str_replace(category,'residual', 'Residual'))
# write table
sens_table <- sens_table_data %>%
    gt() %>%
    tab_header(
        title = "Variance Components Analysis",
        subtitle = "Proportion of Variance in Liking Scores Explained by Sensory Groups"
    ) %>%
    fmt_percent(
        columns = proportion,
        decimals = 1,
        scale_values = FALSE
    ) %>%
    cols_label(
        category = "Sensory Group",
        proportion = "Variance Explained (%)"
    ) %>%
    tab_style(
        style = cell_borders(
            sides = "bottom",
            weight = px(2)
        ),
        locations = cells_row_groups()
    ) %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )

Table 4

sens_table
Variance Components Analysis
Proportion of Variance in Liking Scores Explained by Sensory Groups
Sensory Group Variance Explained (%)
Aroma 31.6%
Texture 16.7%
Flavour 17.1%
Aftertaste 16.2%
Residual 18.4%

Table 5

mets_table
Variance Components Analysis
Proportion of Variance in Liking Scores Explained by Metabolite Groups
Metabolite Group Variance Explained (%)
Volatiles
Phenylalanine-derived 6.4%
Lipid-derived 50.6%
Carotenoid/Terpenes 7.5%
Group Total 64.5%
Non-volatiles
Sugars 6.7%
Acids 4.1%
GTR 7.1%
Group Total 17.9%
Residual
residual 17.5%
Group Total 17.5%
Total 99.9%

Finally a predictive model was generated to predict papaya fruit liking in future selection processes. The variables included in the model were reduced to four to prevent overfitting considering the limited sample size: sulcatone (terpene), linalool oxide (terpene), gamma octalactone (lipid-derived) and total sugars (sucrose, glucose and fructose; sugars). This model may be implemented for selection purposes if the sampling and instrumentation protocols are repeated. New samples may be indexed within the scores of samples used to build the model. Considering the model isn’t equipped for extrapolation, any values outside the ranges of what was used to build the model may distort the predicted liking score. Unscaled data was input to the model algorithm to make it more accessible for new data imput.

# Create a list of variable combinations to test
variable_combinations <- list(
  "Model 1" = c("Sulcatone", "Brix", "Gamma_octalactone", "Hexanal"),
  "Model 2" = c("Sulcatone", "total_sugar", "Gamma_octalactone", "Linalool_oxide"),
  "Model 3" = c("Sulcatone", "Neryl_Geranyl_acetone", "P_cymene", "Brix"),
  "Model 4" = c("Sulcatone", "Neryl_Geranyl_acetone", "P_cymene", "total_sugar"),
  "Model 5" = c("Sulcatone", "total_sugar", "BITC", "TA"),
  "Model 6" = c("Sulcatone", "Brix", "BITC", "TA"),
  "Model 7" = c("Sulcatone", "Brix", "Gamma_octalactone", "BITC"),
  "Model 8" = c("Sulcatone", "total_sugar", "Gamma_octalactone", "BITC"),
  "Model 9" = c("Sulcatone", "total_sugar", "Gamma_octalactone", "TA"),
  "Model 10" = c("Sulcatone", "Brix", "Gamma_octalactone", "TA")
)

# Function to run 5-fold cross-validation using hyperparameter grid search
run_xgb_cv <- function(params, dtrain, nrounds = 500, nfold = 5, early_stopping = 20) {
    cv <- xgb.cv(
        params = params,
        data = dtrain,
        nrounds = nrounds,
        nfold = nfold,
        early_stopping_rounds = early_stopping,
        verbose = FALSE,
        metrics = "rmse"
    )

    return(list(
        rmse = min(cv$evaluation_log$test_rmse_mean),
        iteration = cv$best_iteration
    ))
}

# Grid search function
grid_search_xgb <- function(param_grid, dtrain) {
    grid <- expand.grid(param_grid)

    results <- apply(grid, 1, function(params) {
        param_list <- list(
            objective = "reg:squarederror",
            eta = params["eta"],
            max_depth = params["max_depth"],
            subsample = params["subsample"],
            colsample_bytree = params["colsample_bytree"]
        )

        cv_result <- run_xgb_cv(param_list, dtrain)

        return(c(
            params,
            best_rmse = cv_result$rmse,
            best_iteration = cv_result$iteration
        ))
    })

    results_df <- as.data.frame(t(results))
    return(results_df)
}

# Function to prepare data for a specific variable combination
prepare_data <- function(data, vars) {
  boost_data <- data %>% select(emmean, all_of(vars))

  set.seed(11)  # Ensure reproducibility
  train_index <- sample(1:nrow(boost_data), 0.8 * nrow(boost_data))
  train_data <- boost_data[train_index, ]
  test_data <- boost_data[-train_index, ]

  list(
    dtrain = xgb.DMatrix(data = as.matrix(train_data[, -1]), label = train_data[, 1]),
    dtest = xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data[, 1]),
    test_data = test_data
  )
}

# Modified grid search function to return best parameters and performance metrics
run_model_combination <- function(data, vars) {
  # Prepare data
  prepared_data <- prepare_data(data, vars)

  # Grid search parameters
  param_grid <- list(
    eta = c(0.01, 0.05, 0.1),
    max_depth = c(2, 3, 4),
    subsample = c(0.8, 0.9, 1.0),
    colsample_bytree = c(0.7, 0.85, 1.0)
  )
 
  # Run grid search
  results <- grid_search_xgb(param_grid, prepared_data$dtrain)
  best_params <- results[which.min(results$best_rmse), ]

  # Train final model with best parameters
  xgb_model <- xgboost(
    data = prepared_data$dtrain,
    nrounds = best_params$best_iteration,
    eta = best_params$eta,
    max_depth = best_params$max_depth,
    subsample = best_params$subsample,
    colsample_bytree = best_params$colsample_bytree,
    objective = "reg:squarederror",
    verbose = 0  # Suppress output
  )

  # Make predictions
  predictions <- predict(xgb_model, newdata = prepared_data$dtest)

  # Calculate performance metrics
  metrics <- postResample(pred = predictions, obs = prepared_data$test_data[, "emmean"])

  # Return results
  list(
    variables = paste(vars, collapse = ", "),
    RMSE = metrics["RMSE"],
    Rsquared = metrics["Rsquared"],
    MAE = metrics["MAE"],
    nrounds = best_params$best_iteration,
    eta = best_params$eta,
    max_depth = best_params$max_depth,
    subsample = best_params$subsample,
    colsample_bytree = best_params$colsample_bytree
  )
}

# Model comparison function
model_comparison <- function(data, variable_combinations) {
  # Run models for all combinations
  results <- imap(variable_combinations, \(vars, name) {
    message(sprintf("Processing %s with variables: %s", name, paste(vars, collapse=", ")))
    result <- run_model_combination(data, vars)
    tibble(
      Model_ID = name,  # Directly use the name from imap
      Variables = result$variables,
      RMSE = result$RMSE,
      R_squared = result$Rsquared,
      MAE = result$MAE,
      nrounds = result$best_iteration,
      eta = result$eta,
      max_depth = result$max_depth,
      subsample = result$subsample,
      colsample_bytree = result$colsample_bytree
    )
  }) %>%
    bind_rows()

  return(results)
}

# Run the comparison
model_results <- model_comparison(model_data, variable_combinations)

Table 6: Comparison of XGBoost performance metrics using different variables associated with papaya liking

# Comparison table
model_results %>%
  arrange(RMSE) %>% 
  mutate(across(where(is.numeric), ~ round(.x, digits = 4))) %>%
  datatable()
# Function to run 5-fold cross-validation using hyperparameter grid search
run_xgb_cv <- function(params, dtrain, nrounds = 500, nfold = 5, early_stopping = 50) {
  cv <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = nrounds,
    nfold = nfold,
    early_stopping_rounds = early_stopping,
    verbose = FALSE,
    metrics = "rmse"
  )

  return(list(
    rmse = min(cv$evaluation_log$test_rmse_mean),
    iteration = cv$best_iteration
  ))
}

grid_search_xgb <- function(param_grid, dtrain) {
  # Create grid of all combinations
  grid <- expand.grid(param_grid)

  # Apply CV to each parameter combination
  results <- apply(grid, 1, function(params) {
    param_list <- list(
      objective = "reg:squarederror",
      eta = params["eta"],
      max_depth = params["max_depth"],
      subsample = params["subsample"],
      colsample_bytree = params["colsample_bytree"]#,
      # min_child_weight = params["min_child_weight"],
      # gamma = params["gamma"] 
    )

    cv_result <- run_xgb_cv(param_list, dtrain)

    return(c(
      params,
      best_rmse = cv_result$rmse,
      best_iteration = cv_result$iteration
    ))
  })

  # Convert results to dataframe
  results_df <- as.data.frame(t(results))

  return(results_df)
}

# Data set up
boost_data <- model_data %>% select(emmean, Sulcatone, total_sugar, Gamma_octalactone, Linalool_oxide)

set.seed(11)
train_index <- sample(1:nrow(boost_data), 0.8 * nrow(boost_data))
train_data <- boost_data [train_index,]
test_data <- boost_data [-train_index,]
dtrain <- xgb.DMatrix(data = as.matrix(train_data[, -1]), label = train_data[, 1])
dtest <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data[, 1])
colnames(dtrain) <- c("Sulcatone", "total_sugar", "Gamma_octalactone", "Linalool_oxide")
colnames(dtest) <- c("Sulcatone", "total_sugar", "Gamma_octalactone", "Linalool_oxide")

#  Grid searches

param_grid <- list(
    eta = seq(0.02,0.05,0.01),
    max_depth = c(4),
    subsample = c(0.8),
    colsample_bytree = c(0.85)#,
    # min_child_weight = c(1, 3),
    # gamma = c(0, 0.1) 
)

# Grid search results
results <- grid_search_xgb(param_grid, dtrain)
best_params <- results[which.min(results$best_rmse),]
# run model with best parameters.
set.seed(11)
xgb_model <- xgboost(data = dtrain,
                     nrounds = 497,
                     eta = 0.04,
                     max_depth = 4,
                     subsample = 0.8,
                     colsample_bytree = 0.85,
                     verbose = F,
                     objective= "reg:squarederror")
# Make predictions on test set
predictions <- predict(xgb_model, newdata = dtest)

Table 7: XGBoost performance metrics

# Model evaluation
postResample(pred = predictions, obs = test_data[,"emmean"]) %>%
  as.data.frame() %>% rownames_to_column(var="Metric") %>% gt()
Metric .
RMSE 2.4230439
Rsquared 0.9475259
MAE 1.8311825
# Create a scatter plot of predicted vs actual values for test data
ggplot(data.frame(Predicted = predictions, 
                        Actual = test_data[,"emmean"]), 
             aes(x = Actual, y = Predicted)) + 
  stat_poly_line(color = "firebrick4", fill = "red3") + 
  stat_poly_eq() + theme_tufte() + 
  geom_point(color = "gray15", alpha = .5, shape = 20, size = 3.5) +
  labs( x="Observed Liking", y="Predicted Liking", title = "Predicted Liking ~ Observed Liking") +
  theme(plot.title =  element_text(hjust = 0.5), axis.title = element_text(size=14),title = element_text(size = 14))

Figure 13: XGBoost model predicted liking vs observed liking based on concentrations of sulcatone, gamma octalactone, linalool oxide and total sugar.

# Feature Importance
importance_matrix <- xgb.importance(feature_names = colnames(train_data[,-1]),
                                    model = xgb_model)
xgb.plot.importance(importance_matrix,
                    main = "Feature Importance Plot")

Figure 14: XGBoost variable importance

Table 8:

importance_matrix %>% mutate(Feature = str_replace_all(Feature,c("_"=" ","total"="Total"))) %>% 
    gt() %>%
    tab_header(
        title = "XGBoost Feature importance",
        subtitle = "Features contributing to liking"
    ) %>% 
    fmt_percent(
        columns = everything(),
        decimals = 1, 
        scale_values = TRUE
    )  %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )
XGBoost Feature importance
Features contributing to liking
Feature Gain Cover Frequency Importance
Sulcatone 63.7% 37.3% 34.7% 63.7%
Gamma octalactone 25.1% 18.2% 19.4% 25.1%
Total sugar 6.6% 22.4% 27.1% 6.6%
Linalool oxide 4.7% 22.0% 18.8% 4.7%

Data Summary

The summarised volatile concentrations are reported below

Table 9: Summary of VOCs across 27 papaya genotypes. Sugar values reported in g/100gFW; total soluble solids (brix) reported in °Brix; titratable acids reported in %CA; VOCs are reported in ppb; GTR reported in ppm.

(#tab:VOC summary)Summary of VOC concentrations (ppb) by chemical species
Species CAS Number RT (min) RI Chemical Class n Mean SD SE Median Min Max N detected % detected
Hexanal 66-25-1 22.22 809 Aldehyde 118 18.38 21.25 1.96 9.84 0.03 85.25 118 100.00
Heptanal 111-71-7 26.91 912 Aldehyde 118 32.05 24.52 2.26 23.41 0.00 112.15 117 99.15
Methyl_hexanoate 106-70-7 27.69 929 Ester 118 42.84 199.13 18.33 0.00 0.00 2002.82 49 41.53
Benzaldehyde 100-52-7 30.3 988 Aldehyde 118 14.06 14.68 1.35 7.81 1.77 74.75 118 100.00
Sulcatone 110-93-0 30.53 993 Ketone 118 62.88 84.21 7.75 46.03 0.00 629.09 117 99.15
Octanal 124-13-0 31.43 1013 Aldehyde 118 7.05 5.93 0.55 5.22 0.00 33.49 117 99.15
P_cymene 99-87-6 32.89 1047 Terpene 118 5.20 3.25 0.30 5.11 0.00 20.54 112 94.92
Limonene 5989-27-5 33.15 1054 Terpene 118 26.41 39.93 3.68 12.14 0.00 250.14 87 73.73
Phenylacetaldehyde 122-78-1 33.70 1067 Aldehyde 118 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00
Terpinene 99-85-4 34.22 1079 Terpene 118 10.53 15.37 1.41 5.92 0.00 105.47 105 88.98
Linalool_oxide 34995-77-2 34.77/35.39 1092/1108 Terpene 118 52682.49 108929.32 10027.76 7529.00 0.00 747850.87 116 98.31
Linalool 78-70-6 35.56 1112 Terpene 118 7313.67 18543.25 1707.04 682.34 0.00 112060.59 116 98.31
Nonanal 124-19-6 35.65 1114 Aldehyde 118 9.50 7.91 0.73 8.09 0.43 36.41 118 100.00
Methyl_octanoate 111-11-5 36.14 1126 Ester 118 2.21 3.80 0.35 1.38 0.00 20.87 99 83.90
E_2_nonal 18829-56-6 37.94 1172 Aldehyde 118 0.53 0.88 0.08 0.00 0.00 5.13 43 36.44
Decanal 112-31-2 39.59 1216 Aldehyde 118 0.39 1.20 0.11 0.00 0.00 7.83 18 15.25
Beta_Cyclocitral 432-25-7 40.98 1255 Terpene 118 2.24 2.94 0.27 0.00 0.00 10.01 48 40.68
Gamma_octalactone 104-50-7 41.85 1280 Ester 118 11.38 13.69 1.26 6.45 1.17 63.22 118 100.00
BITC 622-78-6 46.21 1414 Sulfur 118 4.53 7.13 0.66 2.62 0.00 67.27 91 77.12
Neryl_Geranyl_acetone 3796-70-1 48.11 1478 Terpene 118 39.89 42.79 3.94 32.89 2.65 312.72 118 100.00
Beta_ionone 79-77-6 49.4 1524 Terpene 118 6.81 4.93 0.45 5.26 3.47 50.19 118 100.00

Table 10: Variable means for 27 papaya genotypes (standard deviation)


General information

This document was last updated at 2025-04-23 09:02:20.598379 using R Markdown (built with R version 4.4.2 (2024-10-31 ucrt)). The source code for this website can be found on https://github.com/JoshLomax/Papaya_sensory_metabolomics.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com, R Markdown: The Definitive Guide and Rmarkdown cheatsheet.

Bibliography

Brewer S, Plotto A, Bai J, et al. (2021) Evaluation of 21 papaya (carica papaya l.) accessions in southern florida for fruit quality, aroma, plant height, and yield components. Scientia Horticulturae 288:110387. doi: 10.1016/j.scienta.2021.110387
Colantonio V, Ferrão LFV, Tieman DM, et al. (2022) Metabolomic selection for enhanced fruit flavor. Proceedings of the National Academy of Sciences of the United States of America 119:e2115865119. doi: https://doi.org/10.1073/pnas.2115865119
Lomax J, Ford R, Bar I (2024) Multi-omic applications for understanding and enhancing tropical fruit flavour. Plant Molecular Biology 2024 114:4 114:1–17. doi: 10.1007/S11103-024-01480-7
Pico J, Gerbrandt EM, Castellarin SD (2022) Optimization and validation of a SPME-GC/MS method for the determination of volatile compounds, including enantiomeric analysis, in northern highbush blueberries (vaccinium corymbosum l.). Food Chemistry 368:130812. doi: 10.1016/J.FOODCHEM.2021.130812
Zhou Z, Bar I, Ford R, et al. (2022) Biochemical, sensory, and molecular evaluation of flavour and consumer acceptability in australian papaya (carica papaya l.) varieties. International Journal of Molecular Sciences 23:6313. doi: 10.3390/IJMS23116313